Re: [PR] Implement a `Vec` wrapper for `pyarrow.Table` convenience [arrow-rs]

via GitHub Sun, 16 Nov 2025 13:08:00 -0800


jonded94 commented on code in PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#discussion_r2532232970



##########
arrow-array/src/ffi_stream.rs:
##########
@@ -364,7 +364,13 @@ impl Iterator for ArrowArrayStreamReader {
             let result = unsafe {
                 from_ffi_and_data_type(array, 
DataType::Struct(self.schema().fields().clone()))
             };
-            Some(result.map(|data| RecordBatch::from(StructArray::from(data))))
+            Some(result.map(|data| {
+                let struct_array = StructArray::from(data);

Review Comment:
   > The rationale for this form rather than just converting StructArray to 
RecordBatch
   
   Basically what I explained here 
(https://github.com/apache/arrow-rs/pull/8790#issuecomment-3536535648) => 
`StructArray` alone by definition is metadata-less, in turn leading to the 
problem that the resulting `RecordBatch` won't have any metadata attached if 
you just return it as-is.
   
   I'm not sure whether there is another more elegant way to construct a 
`RecordBatch` with corresponding metadata from `ArrayData`. Right now I'm going 
through `StructArray` because the previous interface did that too. If there is 
another more elegant way, please let me know.
   
   Other ways to attach metadata to an existing `RecordBatch` would be, as far 
as I can see, to call `with_schema()` (which will incure some "is subschema 
test" costs) or somehow through `schema_metadata_mut()`, but the interface 
feels a bit clunky for this specific task IMHO.
   
   > A SAFETY comment explaining why the unsafe is ok (aka how the invariants 
required for RecordBatch::new_unchecked are satisfied)
   
   One reason for the `unsafe` here is that I did not want to introduce 
performance penalties in comparison to what the interface did before (it just 
returned `RecordBatch` without checking whether it's actually corresponding to 
the schema of `ArrowArrayStreamReader`; and the schemas actually mismatched 
before my change, at least metadata-wise). 
   
   In principle `Iterator` of `ArrowArrayStreamReader` returns `Result`, so we 
can make this fallible through `RecordBatch::try_new(...)`. This would incur 
some costs though, such as checking each column for correct nullability, equal 
and correct row count, type checks, etc..
   
   I would have guessed that at least data-wise the interface can be trusted 
and therefore the checks can be omitted? :sweat_smile: I'm really not the 
expert here, I would have assumed that someone from the `arrow-rs` team could 
have some opinion here :grimacing: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement a `Vec` wrapper for `pyarrow.Table` convenience [arrow-rs]

Reply via email to