jorisvandenbossche opened a new issue, #35531:
URL: https://github.com/apache/arrow/issues/35531
**Context**: we want that Arrow can be used as the format to share data
between (Python) libraries/applications, ideally in a generic way that doesn't
need to hardcode for specific libraries.
We already have `__arrow_array__` for objects that know how to convert
itself to a `pyarrow.Array` or ChunkedArray. But this protocol is for actual
*py*arrow objects (so a better name might have been `__pyarrow_array__` ..),
thus tied to the pyarrow library (and also only for arrays, not for
tables/batches). For projects that have an (optional) dependency on pyarrow,
that is fine, but we want to avoid that this is required (e.g. nanoarrow).
However, we also have the Arrow C Data Interface as a more generic way to share
Arrow data in-memory focusing on the actual Arrow spec without relying on a
specific library implementation.
Right now, the way to use the C Interface are the `_export_to_c` and
`_import_from_c` methods.
But those methods are 1) private, advanced APIs (although we can of course
decide to make them "official", since many projects are already using them, and
document them that way), and 2) again specific to pyarrow (I don't think other
projects have adopted the same names).
So other projects (polars, datafusion, duckdb, etc) _use_ those to convert
from pyarrow to their own representation. But those projects don't have a
similar API to use the C Data Interface to share their data with another (eg to
pyarrow, or polars to duckdb, ...).
If we would have a standard Python protocol (dunder) method for this,
libraries could implement support for consuming (and producing) objects that
expose their data through the Arrow C Interface without having to hard code for
specific implementations (such as those libraries currently do for pyarrow).
The most generic protocol would be one supporting the Stream interface, and
that could look something like this:
```python
class MyArrowCompatibleObject:
def __arrow_c_stream__(self) -> PyCapsule:
"""
Returning a PyCapsule wrapping an ArrowArrayStream struct
"""
...
```
And in addition we _could_ have variants that do the same for the other
structs, such `__arrow_c_data__` or `__arrow_c_array__`, `__arrow_c_schema__`,
..
Some design questions:
* For the mechanics of the method, I would propose to use PyCapsules instead
of raw pointers as described here: https://github.com/apache/arrow/issues/34031
* Which set of protocol methods do we need? Is only a stream version
sufficient (since a single array can always be put in a stream of one array)?
Or would it be useful (and simpler for some applications) to also have an Array
version?
* But what would an array version return exactly? (since it needs to
return both the ArrowArray as the ArrowSchema)
* With the ongoing discussion about generalizing the C Interface to other
devices (https://github.com/apache/arrow/pull/34972), should we focus here on
the current interfaces, or should we directly use the Device versions?
* Do we want to distinguish between an array and a tabular version? From the
C Interface point of view, that's all the same, it's just a ArrowArray. But for
example, we currently define `_export_to_c` on a RecordBatch and
RecordBatchReader, where you _know_ this will always return a StructArray
representation of one batch, vs the same method on Array where it can return an
array of any type. It could be nice to distinguish those use cases for
consumers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]