jonded94 opened a new pull request, #8790: URL: https://github.com/apache/arrow-rs/pull/8790
# Rationale for this change When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (https://github.com/apache/arrow-rs/issues/7973). In `pyarrow` this is usually solved by representing data as a `pyarrow.Table` object whose columns are `ChunkedArray`s, which basically are just lists of Arrow Arrays, or alternatively, the `pyarrow.Table` is just a representation of a list of `RecordBatch`es. I'd like to build a function in PyO3 that returns a `pyarrow.Table`, very similar to [pyarrow's read_row_group method](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.read_row_group). Currently, AFAIS, there is no way in `arrow-pyarrow` to export a `pyarrow.Table` directly. Especially convenience methods from `Vec<RecordBatch>` seem to be missing. This PR tries to implement a convenience wrapper that allows directly exporting `pyarrow.Table`. # What changes are included in this PR? A new struct `Table` in the crate `arrow-pyarrow` is added which can be constructed from `Vec<RecordBatch>` or from `ArrowArrayStreamReader`. It implements `FromPyArrow` and `IntoPyArrow`. `FromPyArrow` will support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has a `to_reader()` method which does that. `pyarrow.Table` does both of these things. `IntoPyArrow` will result int a `pyarrow.Table` on the Python side, constructed through `pyarrow.Table.from_batches(...)`. # Are these changes tested? No, not yet. Please let me know whether you are in general fine with this PR, then I'll work on tests. So far I only tested it locally with very easy PyO3 dummy functions basically doing a round-trip, and with them everything worked: ``` #[pyfunction] pub fn roundtrip_table(table: PyArrowType<Table>) -> PyArrowType<Table> { table } #[pyfunction] pub fn build_table(record_batches: Vec<PyArrowType<RecordBatch>>) -> PyArrowType<Table> { PyArrowType(Table::try_new(record_batches.into_iter().map(|rb| rb.0).collect()).unwrap()) } ``` => ``` >>> import pyo3parquet >>> import pyarrow >>> table = pyarrow.Table.from_pylist([{"foo": 1}]) >>> pyo3parquet.roundtrip_table(table) pyarrow.Table foo: int64 ---- foo: [[1]] >>> pyo3parquet.build_table(table.to_batches()) pyarrow.Table foo: int64 ---- foo: [[1]] ``` The real tests of course would be much more sophisticated than just this. # Are there any user-facing changes? A new `Table` convience wrapper is added! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
