dataders opened a new pull request, #9951: URL: https://github.com/apache/arrow-rs/pull/9951
# Which issue does this PR close? - Contributes to #7136 (See also the prior attempt #7137, which proposed the always-on form of this fix and was closed after maintainer feedback asking for an opt-in API. This PR implements exactly the [builder shape @alamb sketched in that thread](https://github.com/apache/arrow-rs/pull/7137#issuecomment-2706290700).) # Rationale for this change Some Arrow C Data Interface producers — notably the Go ADBC drivers in wide use today (e.g. the Snowflake driver) — emit data buffers whose pointers are not aligned to the underlying primitive type's alignment. When such buffers reach `arrow-rs`, downstream typed access in [`ScalarBuffer::from`](https://github.com/apache/arrow-rs/blob/main/arrow-buffer/src/buffer/scalar.rs#L184-L198) panics with `"Memory pointer from external source (e.g, FFI) is not aligned with the specified scalar type"`, taking the consumer process down. We hit this in production reading real Snowflake `VARCHAR` columns through the Go ADBC Snowflake driver into a Rust consumer, and we are not the only ones — see e.g. [apache/arrow-adbc#2526](https://github.com/apache/arrow-adbc/issues/2526) and the linked downstream reports. The previous attempt #7137 made `ArrowArrayStreamReader` unconditionally call `align_buffers()` on every imported batch. The discussion there (with @tustvold and @alamb) settled on: - silent reallocation hides producer bugs and changes behavior for consumers who would prefer a hard error; - but for consumers who **know** they're talking to a producer that emits unaligned data (e.g. a specific ADBC driver), they need a way to ask `arrow-rs` to fix it up rather than panicking deep in their kernels. @alamb's sketch: ```rust let stream_reader = ArrowArrayStreamReader::try_new(ffi_stream) .with_align_buffers(true); ``` This PR implements that. # What changes are included in this PR? - Add a private `align_buffers: bool` field to `ArrowArrayStreamReader`, defaulting to `false` so the existing behavior is preserved bit-for-bit. - Add a public builder method `ArrowArrayStreamReader::with_align_buffers(self, bool) -> Self`. - In `Iterator::next`, when the flag is set, call `ArrayData::align_buffers()` on the freshly imported `ArrayData` before constructing the `RecordBatch`. `align_buffers` only copies buffers that are actually misaligned, so adequately aligned producers pay no cost beyond the per-buffer alignment check. The existing PR #7137's diff was 7 lines and changed default behavior. This PR is opt-in only. # Are these changes tested? Yes. Two new tests in `arrow-array/src/ffi_stream.rs`: - `test_unaligned_buffers_default_panics_on_typed_access` — fabricates an `FFI_ArrowArrayStream` whose first batch has a deliberately misaligned `Int32` data buffer (a one-byte-offset slice into an aligned backing buffer). Asserts via `#[should_panic]` that the default reader still panics on typed access, preserving the existing invariant that unaligned producers surface as detectable failures. - `test_unaligned_buffers_with_align_buffers_opt_in` — same input, reader configured with `with_align_buffers(true)`. Asserts the imported buffer's pointer is aligned to `align_of::<i32>()`, that typed access (`.values()`) does not panic, and that the stream is exhausted after the single batch. Both tests bypass the standard `FFI_ArrowArrayStream::new(reader)` helper because that helper materializes the source `RecordBatch` into typed arrays, which itself panics on misaligned input. Instead they build a minimal custom `FFI_ArrowArrayStream` whose `get_next` callback emits an `FFI_ArrowArray` constructed from an untyped `ArrayData` — closer to what a real foreign producer would do. \`\`\` $ cargo test -p arrow-array --features ffi test result: ok. 948 passed; 3 ignored; 0 failed $ cargo clippy -p arrow-array --all-features --tests (clean) $ cargo fmt -p arrow-array --check (clean) \`\`\` # Are there any user-facing changes? Yes, but **additive only**: one new public method on `ArrowArrayStreamReader`. No existing API or default behavior changes. No breaking changes. # Notes for reviewers - Marking as **draft** because the requesting team wants to eyeball it before un-drafting; happy to take any API-shape feedback (e.g. naming the method differently, defaulting differently, adding the same opt-in to other FFI import paths) before that. - The signature `with_align_buffers(mut self, bool) -> Self` mirrors the existing `ArrayDataBuilder::align_buffers(mut self, bool) -> Self` ([data.rs:2222](https://github.com/apache/arrow-rs/blob/main/arrow-data/src/data.rs#L2222)) for consistency. If you'd prefer a parameterless `with_align_buffers(self) -> Self`, easy change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
