[GitHub] [arrow] jorisvandenbossche commented on issue #35531: [Python] Add Python protocol for the Arrow C (Data/Stream) Interface

via GitHub Wed, 24 May 2023 05:28:37 -0700


jorisvandenbossche commented on issue #35531:
URL: https://github.com/apache/arrow/issues/35531#issuecomment-1561032214


   > Yes, I don't think we have to recommend a consumer API. But we'll have to 
choose one for ourselves ;-)
   
   Indeed. And for pyarrow, it could also be something like 
`RecordBatchReader.from_arrow_stream` (or `from_arrow_c_stream`, or other 
name), and similarly for other objects, to keep it consistent with existing 
`from_` methods.
   
   > > > [Do we want to distinguish between an array and a tabular version? 
...] It could be nice to distinguish those use cases for consumers.
   > > 
   > > I'm not sure that's useful. @lidavidm Thoughts?
   > 
   > I'm also not sure it's useful, but it seems we could define 
`__arrow_c_array__` after the fact if we find a use case.
   
   To clarify this part a bit, and assume we are talking about the ArrowArray 
version to keep it simple (not the stream). Currently, a pyarrow.Array can be 
exported to an ArrowArray, and a pyarrow.RecordBatch as well (but in the second 
case, you know you always have a struct type). 
   The C Interface itself doesn't distinguish between both (and that's fine), 
but in practice the interface is used for both "types" of data (array vs 
tabular). And for a _consumer_, I can imagine it would be useful to 
distinguish. For example, assume that pandas has a function to construct a 
pandas.DataFrame from any object that supports this protocol. In that case, 
pandas might only be interested in data that logically represents tabular data, 
and not an array (because then you don't have column names, might have 
top-level nulls, etc). In case there is only a single `__arrow_c_array__`, 
pandas could of course check if the data it received matches the requirements 
for tabular data (i.e. is a struct array and has no validity bitmap). But if 
there would be two protocol methods (eg `__arrow_c_array__` and 
`__arrow_c_batch__`), it could only check for objects that define the second 
method (and declare themselves as tabular data) 
   
   > Did you envision that `__arrow_c_stream__()` could return things that are 
not tables? They certainly can and do outside pyarrow (I beleive Rust2 supports 
it...nanoarrow in R does too). It's a fairly useful representation of a 
ChunkedArray since there's no other officially ABIified way to do that.
   
   Yes, it currently essentially returns an array, not a table. We just 
_mostly_ use for tables in practice. 
   
   As a concrete example: in the arrow-rs implementation, the RecordBatch 
conversion to/from pyarrow actually iterates over each field to convert field 
by field using the C interface on each array, instead of using a single C 
interface call using a struct array for the full RecordBatch 
(https://github.com/apache/arrow-rs/blob/3adca539ad9e1b27892a5ef38ac2780aff4c0bff/arrow/src/pyarrow.rs#L167-L204)
 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35531: [Python] Add Python protocol for the Arrow C (Data/Stream) Interface

Reply via email to