kylebarron commented on issue #5295:
URL: https://github.com/apache/arrow-rs/issues/5295#issuecomment-2401071532

   I think it's fair that arrow-rs doesn't implement ChunkedArray — its surface 
area is big enough already — but I would like to point out that there are some 
use cases not served by its omission.
   
   The most glaring is that it is **not currently possible** to use arrow-rs's 
FFI to exchange something like a `ChunkedArray` when those arrays do not 
represent RecordBatches. 
[`ffi_stream::ArrowArrayStreamReader`](https://docs.rs/arrow/latest/arrow/ffi_stream/struct.ArrowArrayStreamReader.html)
 exists but will error if the data type of the stream is not `Struct`.
   
   This makes it impossible in the general case to interop with a 
`pyarrow.ChunkedArray` or `polars.Series` (via Python).
   
   In pyo3-arrow I have an 
[`ArrayReader`](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/ffi/trait.ArrayReader.html)
 trait to parallel `arrow::RecordBatchReader`, and [vendored a derived copy of 
`ffi_stream.rs`](https://github.com/kylebarron/arro3/blob/0829e34fe250314c2e068ff86e3c5e7ad003d607/pyo3-arrow/src/ffi/from_python/ffi_stream.rs)
 to make it possible to handle this interop (while not necessarily 
materializing the entire stream as a `ChunkedArray`..
   
   > > impl Stream<Item=ArrayRef> a lazy async version of a ChunkedArray - this 
is what DataFusion uses extensively
   > 
   > In case anyone wants details, this is called `RecordBatchStream`:
   
   It's IMO an important distinction that `RecordBatchStream` really _isn't_ a 
`impl Stream<Item=ArrayRef>`, it's a `impl Stream<Item=RecordBatch>`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to