jorisvandenbossche commented on issue #35531: URL: https://github.com/apache/arrow/issues/35531#issuecomment-1561032214
> Yes, I don't think we have to recommend a consumer API. But we'll have to choose one for ourselves ;-) Indeed. And for pyarrow, it could also be something like `RecordBatchReader.from_arrow_stream` (or `from_arrow_c_stream`, or other name), and similarly for other objects, to keep it consistent with existing `from_` methods. > > > [Do we want to distinguish between an array and a tabular version? ...] It could be nice to distinguish those use cases for consumers. > > > > I'm not sure that's useful. @lidavidm Thoughts? > > I'm also not sure it's useful, but it seems we could define `__arrow_c_array__` after the fact if we find a use case. To clarify this part a bit, and assume we are talking about the ArrowArray version to keep it simple (not the stream). Currently, a pyarrow.Array can be exported to an ArrowArray, and a pyarrow.RecordBatch as well (but in the second case, you know you always have a struct type). The C Interface itself doesn't distinguish between both (and that's fine), but in practice the interface is used for both "types" of data (array vs tabular). And for a _consumer_, I can imagine it would be useful to distinguish. For example, assume that pandas has a function to construct a pandas.DataFrame from any object that supports this protocol. In that case, pandas might only be interested in data that logically represents tabular data, and not an array (because then you don't have column names, might have top-level nulls, etc). In case there is only a single `__arrow_c_array__`, pandas could of course check if the data it received matches the requirements for tabular data (i.e. is a struct array and has no validity bitmap). But if there would be two protocol methods (eg `__arrow_c_array__` and `__arrow_c_batch__`), it could only check for objects that define the second method (and declare themselves as tabular data) > Did you envision that `__arrow_c_stream__()` could return things that are not tables? They certainly can and do outside pyarrow (I beleive Rust2 supports it...nanoarrow in R does too). It's a fairly useful representation of a ChunkedArray since there's no other officially ABIified way to do that. Yes, it currently essentially returns an array, not a table. We just _mostly_ use for tables in practice. As a concrete example: in the arrow-rs implementation, the RecordBatch conversion to/from pyarrow actually iterates over each field to convert field by field using the C interface on each array, instead of using a single C interface call using a struct array for the full RecordBatch (https://github.com/apache/arrow-rs/blob/3adca539ad9e1b27892a5ef38ac2780aff4c0bff/arrow/src/pyarrow.rs#L167-L204) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
