jonded94 commented on PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3495859644

   Thanks @kylebarron for your very quick review! :heart: 
   
   > Historically the attitude of this crate has been to avoid "Table" 
constructs to push users towards streaming approaches.
   > 
   > I don't know what the stance of maintainers is towards including a `Table` 
construct for python integration.
   
   Yes, I'm also not too sure about it, that's why I just sketched out a rough 
implementation without tests so far. A reason why I think this potentially 
could be nice to have in `arrow-pyarrow` is that the 
[documentation](https://arrow.apache.org/rust/arrow_pyarrow/index.html) even 
mentions that there is no equivalent concept to `pyarrow.Table` in 
`arrow-pyarrow` and that one has to do slight workarounds to use them:
   
   > PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn’t 
have these same concepts. A chunked table is instead represented with 
Vec<RecordBatch>. A pyarrow.Table can be imported to Rust by calling 
[pyarrow.Table.to_reader()](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_reader)
 and then importing the reader as a [ArrowArrayStreamReader].
   
   At least I personally think having such a wrapper could be nice, since it 
simplifies stuff a bit when you anyways already have `Vec<RecordBatch>` on the 
Rust side somewhere or need to handle a `pyarrow.Table` on the Python side and 
want to have an easy method to generate such a thing from Rust. One still could 
mention in the documentation that generally, streaming approaches are highly 
preferred, and that the `pyarrow.Table` convenience wrapper shall only be used 
in cases where users know what they're doing.
   
   ### Slightly nicer Python workflow
   
   In our very specific example, we have a Python class with a function such as 
this one:
   
   ```Python
   class ParquetFile:
     def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...
   ```
   
   In the [issue](https://github.com/apache/arrow-rs/issues/7973) I linked this 
unfortunately breaks down for a specific parquet file since a particular row 
group isn't expressable as a single `RecordBatch` without changing types 
somewhere. Either you'd have to change the underlying Arrow types from `String` 
to `LargeString` or `StringView`, or you change the returned type from 
`pyarrow.RecordBatch` to `Iterator[pyarrow.RecordBatch]` for example (or 
`RecordBatchReader` or any other streaming-capable object).
   
   The latter comes with a bit of syntactic shortcomings in contexts where you 
want to apply `.to_pylist()` on whatever `read_row_group(...)` returns:
   
   ```Python
   rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = 
ParquetFile(...).read_row_group(0)
   python_objs: list[dict[str, Any]]
   if isinstance(rg, pyarrow.RecordBatch):
     python_objs = rg.to_pylist()
   else:
     python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for 
batch in rg))
   ```
   
   With `pyarrow.Table`, there already exists a thing which simplifies this a 
lot on the Python side:
   
   ```Python
   rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
   python_objs: list[dict[str, Any]] = rg.to_pylist()
   ```
   
   And just for clarity, we unfortunately *need* to have the entire Row group 
deserialized as Python objects because our data ingestion pipelines that 
consume this are expecting to have access to the entire row group in bulk, so 
streaming approaches are sadly not usable.
   
   > FWIW if you wanted to look at external crates, [`PyTable` 
exists](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/struct.PyTable.html) that 
probably does what you want. (disclosure it's my project). That alternatively 
might give you ideas for how to handle the `Table` here if you still want to do 
that. (It's a separate crate for [these 
reasons](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration))
   
   Yes, in general, I much prefer the approach of `arro3` to be totally 
`pyarrow` agnostic. In our case unfortunately, we're right now still pretty 
hardcoded against `pyarrow` specifics and just use `arrow-rs` as a means to 
reduce memory load compared to reading & writing parquet datasets with 
`pyarrow` directly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to