jonded94 commented on code in PR #8790:
URL: https://github.com/apache/arrow-rs/pull/8790#discussion_r2532232970
##########
arrow-array/src/ffi_stream.rs:
##########
@@ -364,7 +364,13 @@ impl Iterator for ArrowArrayStreamReader {
let result = unsafe {
from_ffi_and_data_type(array,
DataType::Struct(self.schema().fields().clone()))
};
- Some(result.map(|data| RecordBatch::from(StructArray::from(data))))
+ Some(result.map(|data| {
+ let struct_array = StructArray::from(data);
Review Comment:
> The rationale for this form rather than just converting StructArray to
RecordBatch
Basically what I explained here
(https://github.com/apache/arrow-rs/pull/8790#issuecomment-3536535648) =>
`StructArray` alone by definition is metadata-less, in turn leading to the
problem that the resulting `RecordBatch` won't have any metadata attached if
you just return it as-is.
I'm not sure whether there is another more elegant way to construct a
`RecordBatch` with corresponding metadata from `ArrayData`. Right now I'm going
through `StructArray` because the previous interface did that too. If there is
another more elegant way, please let me know.
Other ways to attach metadata to an existing `RecordBatch` would be, as far
as I can see, to call `with_schema()` (which will incure some "is subschema
test" costs) or somehow through `schema_metadata_mut()`, but the interface
feels a bit clunky for this specific task IMHO.
> A SAFETY comment explaining why the unsafe is ok (aka how the invariants
required for RecordBatch::new_unchecked are satisfied)
One reason for the `unsafe` here is that I did not want to introduce
performance penalties in comparison to what the interface did before (it just
returned `RecordBatch` without checking whether it's actually corresponding to
the schema of `ArrowArrayStreamReader`; and the schemas actually mismatched
before my change, at least metadata-wise).
In principle `Iterator` of `ArrowArrayStreamReader` returns `Result`, so we
can make this fallible through `RecordBatch::try_new(...)`. This would incur
some costs though, such as checking each column for correct nullability, equal
and correct row count, type checks, etc..
I would have guessed that at least data-wise the interface can be trusted
and therefore the checks can be omitted? :sweat_smile: I'm really not the
expert here, I would have assumed that someone from the `arrow-rs` team could
have some opinion here :grimacing:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]