kylebarron commented on issue #5295: URL: https://github.com/apache/arrow-rs/issues/5295#issuecomment-2401071532
I think it's fair that arrow-rs doesn't implement ChunkedArray — its surface area is big enough already — but I would like to point out that there are some use cases not served by its omission. The most glaring is that it is **not currently possible** to use arrow-rs's FFI to exchange something like a `ChunkedArray` when those arrays do not represent RecordBatches. [`ffi_stream::ArrowArrayStreamReader`](https://docs.rs/arrow/latest/arrow/ffi_stream/struct.ArrowArrayStreamReader.html) exists but will error if the data type of the stream is not `Struct`. This makes it impossible in the general case to interop with a `pyarrow.ChunkedArray` or `polars.Series` (via Python). In pyo3-arrow I have an [`ArrayReader`](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/ffi/trait.ArrayReader.html) trait to parallel `arrow::RecordBatchReader`, and [vendored a derived copy of `ffi_stream.rs`](https://github.com/kylebarron/arro3/blob/0829e34fe250314c2e068ff86e3c5e7ad003d607/pyo3-arrow/src/ffi/from_python/ffi_stream.rs) to make it possible to handle this interop (while not necessarily materializing the entire stream as a `ChunkedArray`.. > > impl Stream<Item=ArrayRef> a lazy async version of a ChunkedArray - this is what DataFusion uses extensively > > In case anyone wants details, this is called `RecordBatchStream`: It's IMO an important distinction that `RecordBatchStream` really _isn't_ a `impl Stream<Item=ArrayRef>`, it's a `impl Stream<Item=RecordBatch>`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
