kylebarron commented on PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3497654509

   > one has to do slight workarounds to use them:
   
   I think that's outdated for Python -> Rust. I haven't tried but you should 
be able to pass a `pyarrow.Table` directly into an `ArrowArrayStreamReader` on 
the Rust side, because it just looks for the `__arrow_c_stream__` method that 
exists either on the `Table` or the `pyarrow.RecordBatchReader`.
   
   But I assume there's no way today to easily return a `Table` from Rust to 
Python.
   
   > At least I personally think having such a wrapper could be nice, since it 
simplifies stuff a bit when you anyways already have `Vec<RecordBatch>` on the 
Rust side somewhere or need to handle a `pyarrow.Table` on the Python side and 
want to have an easy method to generate such a thing from Rust.
   
   I'm fine with that; and I think other maintainers would probably be fine 
with that too, since it's only a concept that exists in the Python integration.
   
   I'm not sure I totally get your example. Seems bad to be returning a union 
of multiple types to Python. But seems reasonable to return a `Table` there. 
The alternative is to return a stream and have the user either iterate over it 
lazily or choose to materialize it with 
`pa.table(ParquetFile.read_row_group(...))`.
   
   
   > And just for clarity, we unfortunately _need_ to have the entire Row group 
deserialized as Python objects because our data ingestion pipelines that 
consume this are expecting to have access to the entire row group in bulk, so 
streaming approaches are sadly not usable.
   
   Well there's nothing stopping you from materializing the stream by passing 
it to `pa.table()`. You don't have to use the stream as a stream.
   
   > Yes, in general, I much prefer the approach of `arro3` to be totally 
`pyarrow` agnostic. In our case unfortunately, we're right now still pretty 
hardcoded against `pyarrow` specifics and just use `arrow-rs` as a means to 
reduce memory load compared to reading & writing parquet datasets with 
`pyarrow` directly.
   
   You can use `pyo3-arrow` with `pyarrow` as well, but I'm not opposed to 
adding this functionality to arrow-rs as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to