jonded94 opened a new pull request, #8790:
URL: https://github.com/apache/arrow-rs/pull/8790

   # Rationale for this change
   
   When dealing with Parquet files that have an exceedingly large amount of 
Binary or UTF8 data in one row group, there can be issues when returning a 
single RecordBatch because of index overflows 
(https://github.com/apache/arrow-rs/issues/7973). 
   
   In `pyarrow` this is usually solved by representing data as a 
`pyarrow.Table` object whose columns are `ChunkedArray`s, which basically are 
just lists of Arrow Arrays, or alternatively, the `pyarrow.Table` is just a 
representation of a list of `RecordBatch`es.
   
   I'd like to build a function in PyO3 that returns a `pyarrow.Table`, very 
similar to [pyarrow's read_row_group 
method](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.read_row_group).
   Currently, AFAIS, there is no way in `arrow-pyarrow` to export a 
`pyarrow.Table` directly. Especially convenience methods from 
`Vec<RecordBatch>` seem to be missing. This PR tries to implement a convenience 
wrapper that allows directly exporting `pyarrow.Table`.
   
   # What changes are included in this PR?
   
   A new struct `Table` in the crate `arrow-pyarrow` is added which can be 
constructed from `Vec<RecordBatch>` or from `ArrowArrayStreamReader`. 
   It implements `FromPyArrow` and `IntoPyArrow`. 
   
   `FromPyArrow` will support anything that either implements the 
ArrowStreamReader protocol or is a RecordBatchReader, or has a `to_reader()` 
method which does that. `pyarrow.Table` does both of these things.
   `IntoPyArrow` will result int a `pyarrow.Table` on the Python side, 
constructed through `pyarrow.Table.from_batches(...)`.
   
   # Are these changes tested?
   
   No, not yet. Please let me know whether you are in general fine with this 
PR, then I'll work on tests. So far I only tested it locally with very easy 
PyO3 dummy functions basically doing a round-trip, and with them everything 
worked:
   
   ```
   #[pyfunction]
   pub fn roundtrip_table(table: PyArrowType<Table>) -> PyArrowType<Table> {
       table
   }
   
   #[pyfunction]
   pub fn build_table(record_batches: Vec<PyArrowType<RecordBatch>>) -> 
PyArrowType<Table> {
       PyArrowType(Table::try_new(record_batches.into_iter().map(|rb| 
rb.0).collect()).unwrap())
   }
   ```
   =>
   ```
   >>> import pyo3parquet
   >>> import pyarrow
   >>> table = pyarrow.Table.from_pylist([{"foo": 1}])
   >>> pyo3parquet.roundtrip_table(table)
   pyarrow.Table
   foo: int64
   ----
   foo: [[1]]
   >>> pyo3parquet.build_table(table.to_batches())
   pyarrow.Table
   foo: int64
   ----
   foo: [[1]]
   ```
   The real tests of course would be much more sophisticated than just this.
   
   # Are there any user-facing changes?
   
   A new `Table` convience wrapper is added!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to