Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

via GitHub Mon, 18 Mar 2024 18:25:38 -0700


paleolimbot commented on issue #40648:
URL: https://github.com/apache/arrow/issues/40648#issuecomment-2005564663


   > nanoarrow implements both `__arrow_c_array__` and `__arrow_c_stream__`
   
   For reference, the PR implementing the `nanoarrow.Array` is 
https://github.com/apache/arrow-nanoarrow/pull/396 . It is basically a 
ChunkedArray and is currently the only planned user-facing Arrayish thing, 
although it's all very new (feel free to comment on that PR!). Basically, I 
found that maintaining both a chunked and a non-chunked pathway in 
geoarrow-pyarrow resulted in a lot of Python loops over chunks and I wanted to 
avoid forcing nanoarrow users to maintain two pathways. Many pyarrow methods 
might give you back a `Array` or a `ChunkedArray`; however, many 
`ChunkedArray`s only have one chunk. The whole thing is imperfect and a bit of 
a compromise.
   
   > Fundamentally, my question is whether the existence of methods on an 
object should allow for an inference of its storage type
   
   My take on this is that as long as the object has an unambiguous 
interpretation as a contiguous array (or *might* have one, since it might take 
a loop over something that is not already Arrow data to figure this out), I 
think it's fine for `__arrow_c_array__` to exist.  As long as an object has an 
unambiguous interpretation as zero or more arrays (or *might* have one), I 
think `__arrow_c_stream__` can exist. I don't see those as mutually 
exclusive...for me this is like `pyarrow.array()` returning either a 
`ChunkedArray` or an `Array`: it just doesn't know until it sees the input what 
type it needs to unambiguously represent it.
   
   For something like an `Array` or `RecordBatch` (or something like a `numpy` 
array) that is definitely Arrow and is definitely contiguous, I am not sure 
what the benefit would be for `__arrow_c_stream__` to exist and it is probably 
just confusing if it does.
   
   There are other assumptions that can't be captured by the mere existence of 
either of those, like exactly how expensive it will be to call any one of those 
methods. In https://github.com/shapely/shapely/pull/1953  both are fairly 
expensive because the data are not Arrow yet. For a database driver, it might 
expensive to consume the stream because the data haven't arrived over the 
network yet.
   
   The Python buffer protocol has a `flags` field to handle consumer requests 
along these lines (like a request for contiguous, rather than strided, memory) 
that could be used to disambiguate some of these cases if it turns out that 
disambiguating them is important. It is also careful to note that the existence 
of the buffer protocol implementation does not imply that attempting to get the 
buffer will succeed.
   
   For consuming in nanoarrow, the current approach is to use 
`__arrow_c_stream__` whenever possible since this has the fewest constraints 
(arrays need not be in memory yet, need not be contiguous, might not be fully 
consumed). Then it falls back on `__arrow_c_array__`. The entrypoint is 
`nanoarrow.c_array_stream()`, which will happily accept either (generates a 
length-one stream if needed).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Conventions around PyCapsule Interface and choosing Array/Stream export [arrow]

Reply via email to