jonded94 commented on PR #8790: URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3536513427
Hey, I pushed a version that actually does not use the PyCapsule ArrayStream interface for converting a `pyarrow.Table` to `Table`, if the given python object has a `to_batches()` method (which the `pyarrow.Table` does). This is not necessarily intended to stay that way, but this is helpful for diagnosing where RecordBatch metadata is dropped. `pyarrow.Table.to_batches()` returns a `list[pyarrow.RecordBatch]` which I explicitly convert to `Vec<RecordBatch>` in the `from_pyarrow_bound` function of `impl FromPyArrow for Table`. This basically is the equivalent of what I'm doing in the corresponding `impl IntoPyArrow for Table`, as I'm not using the PyCapsule interface there, but just immediately construct a `pyarrow.Table` out of `Vec<RecordBatch>` through `pyarrow.Table.from_batches(...)`. With that, I got RecordBatches with preserved metadata from `pyarrow.Table`, in turn allowing me to drop the `schema_equals` function but instead do a full `schema == record_batch.schema()` check. Since I also checked on the Python side with a `pyarrow.RecordBatchReader.from_stream` of a `StreamWrapper` around a `pyarrow.Table` that RecordBatches from a ArrayStream PyCapsule interface of a `pyarrow.Table` definitely still have their metadata, the error has to be on the Rust side somewhere in the `Box<dyn RecordBatchReader>` / `impl FromPyArrow for ArrowArrayStreamReader` method. Potentially there is a slight misuse of the PyCapsule interface somewhere, as this definitely seems to return RecordBatches without metadata. I'm not too familiar with the low-level stuff there, but I'll try to investigate; help is appreciated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
