jonded94 commented on PR #8790: URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3498377722
> I think that's outdated for Python -> Rust. I haven't tried but you should be able to pass a pyarrow.Table directly into an ArrowArrayStreamReader on the Rust side Yes, exactly, that's what I even mentioned here in this PR (https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR491-R492): ``` /// (although technically, since `pyarrow.Table` implements the ArrayStreamReader PyCapsule /// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` instead) ``` As you said, the opposite, namely easily returning a `Vec<RecordBatch>` as a `pyarrow.Table` to Python is what's really missing here and what this PR mainly is about. > I'm not sure I totally get your example. Seems bad to be returning a union of multiple types to Python. My example wasn't entirely complete for simplicitly (and still isn't), it would be more something like this: ```Python class ParquetFile: @overload def read_row_group(self, index: int, as_table: Literal[True]) -> pyarrow.Table: ... @overload def read_row_group(self, index: int, as_table: Literal[False]) -> pyarrow.RecordBatch: ... def read_row_group(self, index: int, as_table: bool = False) -> pyarrow.RecordBatch | pyarrow.Table: ... ``` The advantage of that would be that both `pyrrow.RecordBatch` and `pyarrow.Table` implement `.to_list() -> list[dict[str, Any]]`. This is the important bit here, as we later just want to be able to call `to_pylist()` on whatever singular object `read_row_group(...)` returns and be guaranteed that the entire row group is deserialized as Python objects in this list. So it also could be expressed in our very specific example as: ```Python class ToListCapable(Protocol): def to_pylist(self) -> list[dict[str, Any]]: ... class ParquetFile: def read_row_group(self, index: int, as_table: bool = False) -> ToListCapable: ... ``` > The alternative is to return a stream and have the user either iterate over it lazily or choose to materialize it with pa.table(ParquetFile.read_row_group(...)). & > Well there's nothing stopping you from materializing the stream by passing it to pa.table(). You don't have to use the stream as a stream. Yes, sure!. We also do that in other places, or have entirely streamable pipelines elsewhere that use the PyCapsule ArrowStream interface. It's just that for this very specific use case, a `Vec<RecordBatch>` -> `pyarrow.Table` convenience wrapper perfectly maps to what we need with no required changes in any consuming code, and I would be interested in whether maintainers of `arrow-pyarrow` find that useful for very specific niche use cases, as I said. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
