jonded94 commented on PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3498377722

   > I think that's outdated for Python -> Rust. I haven't tried but you should 
be able to pass a pyarrow.Table directly into an ArrowArrayStreamReader on the 
Rust side
   
   Yes, exactly, that's what I even mentioned here in this PR 
(https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR491-R492):
   
   ```
   /// (although technically, since `pyarrow.Table` implements the 
ArrayStreamReader PyCapsule
   /// interface, one could also consume a 
`PyArrowType<ArrowArrayStreamReader>` instead)
   ```
   
   As you said, the opposite, namely easily returning a `Vec<RecordBatch>` as a 
`pyarrow.Table` to Python is what's really missing here and what this PR mainly 
is about.
   
   > I'm not sure I totally get your example. Seems bad to be returning a union 
of multiple types to Python. 
   
   My example wasn't entirely complete for simplicitly (and still isn't), it 
would be more something like this:
   
   ```Python
   class ParquetFile:
     @overload
     def read_row_group(self, index: int, as_table: Literal[True]) -> 
pyarrow.Table: ...
     @overload
     def read_row_group(self, index: int, as_table: Literal[False]) -> 
pyarrow.RecordBatch: ...
     def read_row_group(self, index: int, as_table: bool = False) -> 
pyarrow.RecordBatch | pyarrow.Table: ...
   ```
   
   The advantage of that would be that both `pyrrow.RecordBatch` and 
`pyarrow.Table` implement `.to_list() -> list[dict[str, Any]]`. This is the 
important bit here, as we later just want to be able to call `to_pylist()` on 
whatever singular object `read_row_group(...)` returns and be guaranteed that 
the entire row group is deserialized as Python objects in this list. So it also 
could be expressed in our very specific example as:
   
   ```Python
   class ToListCapable(Protocol):
     def to_pylist(self) -> list[dict[str, Any]]: ...
   
   class ParquetFile:
     def read_row_group(self, index: int, as_table: bool = False) -> 
ToListCapable: ...
   ```
   
   > The alternative is to return a stream and have the user either iterate 
over it lazily or choose to materialize it with 
pa.table(ParquetFile.read_row_group(...)).
   
   &
   
   > Well there's nothing stopping you from materializing the stream by passing 
it to pa.table(). You don't have to use the stream as a stream.
   
   Yes, sure!. We also do that in other places, or have entirely streamable 
pipelines elsewhere that use the PyCapsule ArrowStream interface. It's just 
that for this very specific use case, a `Vec<RecordBatch>` -> `pyarrow.Table` 
convenience wrapper perfectly maps to what we need with no required changes in 
any consuming code, and I would be interested in whether maintainers of 
`arrow-pyarrow` find that useful for very specific niche use cases, as I said.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to