atwam commented on issue #9716:
URL: https://github.com/apache/arrow-rs/issues/9716#issuecomment-4273961244
I think the important distinction is between what C++ accepts and what C++
normally produces.
My reading is:
- The spec still describes variable-width offsets as having length `length +
1`, so an empty offsets buffer for a zero-length Binary/List array is not the
canonical layout and does not appear to be explicitly allowed by the spec.
- C++ is deliberately permissive on the read/validation side for `length ==
0` arrays. That is not accidental, and was added intentionally in the PR you
reference
- However, the normal C++ builder path is still canonical. `BinaryBuilder` /
`ListBuilder` start with no offsets and append the trailing `0` during
`Finish`, so **builder-produced empty arrays end up with an offsets buffer of
size 1 containing `0`**.
- The C++ IPC writer does not normalize zero-length arrays into canonical
form. It preserves whatever in-memory representation the array already has. So
if someone directly constructs a zero-length array with empty/null offsets, C++
IPC can write such a file too.
- So “follow C++” can mean two different things:
1. follow C++’s permissive reader/validator behavior and accept empty
offsets on input, or
2. follow C++’s normal builder-produced output and write canonical `[0]`
offsets.
However, I think there is a stronger argument than just “this is
non-canonical”.
On the C++ side, IPC read preserves a zero-byte offsets buffer for length-0
binary/list arrays, and validation explicitly allows that representation for
compatibility. However, several downstream APIs still assume canonical `length
+ 1` offsets and will dereference the missing sentinel.
Concrete examples:
- `ListArray::Flatten()` reads `value_offset(0)` and `value_offset(length)`
even for an empty list array.
- `list_parent_indices` reads `offsets[length] - offsets[0]` for list arrays.
- CSV writing for string arrays calls `value_offset(0)` without a length-0
guard.
- The integration JSON writer exports `length + 1` offsets unconditionally.
- `ListArray::offsets()` can box the empty underlying offsets buffer as an
offsets array of length 1, which is internally inconsistent.
**Reproduction confirms out-of-bounds read in C++ when reading Rust-written
IPC files:** A minimal Rust program writing an empty `ListArray` via
`FileWriter` (on current `main`) produces an IPC file with a zero-byte offsets
buffer. When read back through PyArrow (C++ underneath), `ListArray::offsets()`
returns a 1-element `Int32Array` containing garbage (e.g. `-189153672`) — this
is an uninitialized/out-of-bounds memory read, because C++ `BoxOffsets()`
unconditionally constructs an array of `length + 1` entries from whatever
buffer is present, even when that buffer is empty.
So while C++ is permissive enough to *accept* these arrays, and many of its
operations will explicitely check for `length == 0` (bypassing any reading of
the offsets buffer altogether), that does not mean the representation is safe
throughout the library. There are downstream assumptions that a sentinel offset
exists. That strengthens the case for writing canonical `[0]` offsets in
arrow-rs even if some readers currently tolerate the empty-offset form.
Given that, I think the safest behavior for arrow-rs is:
- write the canonical `[0]` offsets buffer for zero-length variable-width
arrays, and
- continue to read/accept the permissive empty-offset form for compatibility.
If we want to make empty offsets formally spec-valid rather than just
tolerated by some implementations, then I agree that should probably be
discussed on the Arrow mailing list first.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]