jorisvandenbossche commented on issue #70:
URL: https://github.com/apache/arrow-adbc/issues/70#issuecomment-1825633068

   Started looking at this again, exploring how we can adopt the capsule 
protocol in ADBC python.
   
   The low-level interface (`_lib.pyx`) has no dependency on pyarrow, and 
currently has `ArrowSchemaHandle` / `ArrowArrayHandle` / 
`ArrowArrayStreamHandle` classes that hold the C structs, and those classes are 
returned in the various objects. 
   
   I think in theory we could replace those current Handle classes with 
PyCapsules directly (and in the higher-level code, we can still extract the 
pointer from the PyCapsule when having to deal with a pyarrow version that 
doesn't support capsules directly). 
   However, those handle objects are currently exposed in the public API, so 
would we be OK with just removing them? (I don't know if there are external 
libraries that use those directly?) 
   We could also keep them, add add the dunder methods to them so they are 
importable without having to access the integer address. Which is what David 
mentioned above 
(https://github.com/apache/arrow-adbc/issues/70#issuecomment-1790708865), I 
assume:
   
   > I guess that means I'd keep the existing objects we have (that expose the 
raw address), and just have them implement the new dunder methods in addition
   
   One advantage of keeping some generic wrapper/handle class that has the 
dunders vs returning pycapsules directly, is that the return values of the 
low-level interface can then more easily be passed to a library that expects an 
object with the dunder defined instead of the capsules directly (i.e. how we 
currently implemented support for this in pyarrow, e.g. `pyarrow.array(..)` 
checks for `__arrow_c_array__` attribute on the passed object, but doesn't 
accept capsules directly, xref https://github.com/apache/arrow/issues/38010)
   
   ---
   
   The DBAPI is much more tied to pyarrow, so I don't know if, on the short 
term, we want to enable getting arrow data without a dependency on `pyarrow` 
and only relying on the capsule protocol. That would require quite some changes.
   
   Just getting an overview for myself:
   
   - `Connection.adbc_get_info` and `adbc_get_table_types` -> returns info as a 
dict or list, but under the hood consumes the stream with pyarrow and convert 
to pylist -> this doesn't expose pyarrow data, so on the short term this could 
continue to require pyarrow as a runtime dependency
   - `Connection.adbc_get_objects` returns a pyarrow.RecordBatchReader
   - `Connection.adbc_get_table_schema` returns a pyarrow.Schema
   - `Cursor._bind` / `_prepare_execute` / `execute` -> currently accepts 
pyarrow RecordBatch/Table as `parameters`, but this can be expanded to any 
object that has the array or array stream protocol
   - `Cursor.execute` / `adbc_read_partition` sets a `_result` object to 
`_RowIterator`of a `AdbcRecordBatchReader`. This reader subclasses 
`pyarrow.RecordBatchReader` (to ensure adbc error messages are properly 
propagated)
   - `Cursor.adbc_ingest` accepts pyarrow RecordBatch, Table or 
RecordBatchReader as `data`, this can also be generalized to any object that 
supports the protocol
   - `Cursor.adbc_execute_schema` / `adbc_prepare` returns a pyarrow.Schema
   - `Cursor.fetchallarrow` / `fetch_arrow_table` return a pyarrow.Table
   - `Cursor.fetch_record_batch` returns a pyarrow.RecordBatchReader
   - `Cursor.fetchone` / `fetchmany` / `fetchall` use the `_RowIterator`, which 
uses the pyarrow RecordBatchReader to iterate over the data -> this only uses 
pyarrow under the hood for converting the data to Python tuples, and so on the 
short term can continue to do that with a pyarrow runtime dependency.
   
   Changing the methods that currently return a pyarrow 
Schema/Table/RecordBatchReader to return a generic object implementing the 
dunders seems too much of a breaking change (and would also be much less 
convenient for pyarrow users, e.g. requiring a users to do 
`pyarrow.schema(conn.get_table_schema())` instead of 
`conn.get_table_schema()`). 
   But would we want to add _additional_ method variants of each of those that 
return something generic like that?
   
   Of course, assuming you have pyarrow 14+ installed, by returning pyarrow 
objects which implement the dunders, we of course also automatically make the 
capsule protocol available in ADBC. So it's more a question to what extent we 
want to make it possible to use the DBAPI layer without pyarrow.
   
   ---
   
   So in summary, short term I think I would do:
   
   - Add the dunder methods to the handle classes of the low-level interface, 
which already enables using the low-level interface without pyarrow and with 
the capsule protocol
   - In the places that accept data (eg ingest), generalize to accept objects 
that implement the dunders in addition to hardcoded support for pyarrow
   
   And then longer term we can think about whether we also want to enable using 
the DBAPI in _some form_ without a runtime dependency on pyarrow.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to