jonded94 commented on PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#issuecomment-3504851078

   Hey @kylebarron, thanks for the review! I implemented everything as you 
suggested.
   
   As you can see, now the CI is broken, because of a suttle problem that was 
uncovered. Maybe you can help me, as I'm not too familiar with the FFI logic 
and all my sanity checks did not help me:
   
   In the `test_table_roundtrip` Python test in 
`arrow-pyarrow-integration-test`, we're simply handing a `pyarrow.Table` to 
Rust and letting it roundtrip through the conversion layers back to a 
`pyarrow.Table`. Unfortunately, the conversion from `PyArrowType<Table>` -> 
`ArrowArrayStreamReader` -> `Table` is now failing, specifically the last part.
   
   The `PyArrowType<Table>` -> `ArrowArrayStreamReader` conversion works, but 
as soon as `RecordBatch`es are read from the stream reader in the `try_new` 
[function](https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR514)
 of `Table`, they lose their metadata. This fails then, because the `try_new` 
function validates that the schema of all record batches corresponds to the 
explicitly given schema.
   
   The schema itself from the `ArrowArrayStreamReader` still has the metadata 
`{"key1": "value1"}` attached, but not the individual `RecordBatch`es. I left a 
somewhat verbose error message in the Rust error:
   
   ```
   ValueError: Schema error: All record batches must have the same schema. 
Expected schema: Schema { fields: [Field { name: "ints", data_type: List(Field 
{ data_type: Int32, nullable: true }), nullable: true }], metadata: {"key1": 
"value1"} }, got schema: Schema { fields: [Field { name: "ints", data_type: 
List(Field { data_type: Int32, nullable: true }), nullable: true }], metadata: 
{} 
   ```
   
   This previously worked because I used an `unsafe` interface for building a 
`Table` before which didn't check for schema validity.
   
   Sanity checks:
   - All other roundtrips work without problem, the metadata seems to be handed 
through all layers otherwise
   - In the failing Python test `test_table_roundtrip`, I asserted that the 
`pyarrow.Table` definitly *has* the metadata still attached, and especially all 
RecordBatches from it
     - Just in the conversion to Rust `RecordBatch` through the `Box<dyn 
ArrowArrayStreamReader>` they somehow seem to lose their metadata. Is this 
something which the FFI interface doesn't guarantee?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to