jonded94 commented on PR #8790: URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3495859644
Thanks @kylebarron for your very quick review! :heart: > Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches. > > I don't know what the stance of maintainers is towards including a `Table` construct for python integration. Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in `arrow-pyarrow` is that the [documentation](https://arrow.apache.org/rust/arrow_pyarrow/index.html) even mentions that there is no equivalent concept to `pyarrow.Table` in `arrow-pyarrow` and that one has to do slight workarounds to use them: > PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn’t have these same concepts. A chunked table is instead represented with Vec<RecordBatch>. A pyarrow.Table can be imported to Rust by calling [pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader) and then importing the reader as a [ArrowArrayStreamReader]. At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have `Vec<RecordBatch>` on the Rust side somewhere or need to handle a `pyarrow.Table` on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the `pyarrow.Table` convenience wrapper shall only be used in cases where users know what they're doing. ### Slightly nicer Python workflow In our very specific example, we have a Python class with a function such as this one: ```Python class ParquetFile: def read_row_group(self, index: int) -> pyarrow.RecordBatch: ... ``` In the [issue](https://github.com/apache/arrow-rs/issues/7973) I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single `RecordBatch` without changing types somewhere. Either you'd have to change the underlying Arrow types from `String` to `LargeString` or `StringView`, or you change the returned type from `pyarrow.RecordBatch` to `Iterator[pyarrow.RecordBatch]` for example (or `RecordBatchReader` or any other streaming-capable object). The latter comes with a bit of syntactic shortcomings in contexts where you want to apply `.to_pylist()` on whatever `read_row_group(...)` returns: ```Python rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0) python_objs: list[dict[str, Any]] if isinstance(rg, pyarrow.RecordBatch): python_objs = rg.to_pylist() else: python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg)) ``` With `pyarrow.Table`, there already exists a thing which simplifies this a lot on the Python side: ```Python rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0) python_objs: list[dict[str, Any]] = rg.to_pylist() ``` And just for clarity, we unfortunately *need* to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable. > FWIW if you wanted to look at external crates, [`PyTable` exists](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/struct.PyTable.html) that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the `Table` here if you still want to do that. (It's a separate crate for [these reasons](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration)) Yes, in general, I much prefer the approach of `arro3` to be totally `pyarrow` agnostic. In our case unfortunately, we're right now still pretty hardcoded against `pyarrow` specifics and just use `arrow-rs` as a means to reduce memory load compared to reading & writing parquet datasets with `pyarrow` directly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
