brunal opened a new pull request, #46408:
URL: https://github.com/apache/arrow/pull/46408
### Rationale for this change
Arrow C++ slices arrays by bumping the top-level `offset` value.
However, Arrow Rust slices list arrays by slicing the `value_offsets`
buffer. When receiving a Rust Arrow Array in C++ (via the C data
interface), its IPC serialization fails to notice that the
`value_offsets` buffer needed to be updated, but it still updates the
`values` buffer. This leads to a corrupt array on deserialization, with
an `value_offsets` buffer that points past the end of the values array.
This PR fixes the IPC serialization by looking at value_offset(0) to
determine whether the `value_offsets` buffer needs reconstructing,
instead of just looking at offset(). This works because
value_offset(int) is value_offset buffer + top-level offset (through
ArrayData::GetValuesSafe(int)).
Additionally, I update the comment surrounding the logic, as
it had 2 issues:
1. It hints that offset > 0 and value_offsets[0] > 0 happen together,
when they actually tend to be exclusive (... unless you slice twice,
once in Rust and once in C++).
2. It mentions slicing the values, when that does not happen in the
function where the comment appears (GetZeroBasedValueOffsets), but at
call site (Visit(Array)).
### What changes are included in this PR?
The fix, as well as a small cleanup of the function:
* Extend the loop to cover i == array.length() instead of a dedicated
statement.
* Avoid copying the `offsets_buffer` shared_ptr (avoid refcount ++ then
--).
Note: I am not sure that the logic covered by the
[ARROW-6046](https://issues.apache.org/jira/browse/ARROW-6046) tag does
anything. I don't think it should trigger lest the offsets buffer is
invalid, and removing it does not cause any test to fail.
### Are these changes tested?
Yep
### Are there any user-facing changes?
Nope (well, unless they are affected by the bug)
**This PR contains a "Critical Fix".** (the changes fix (a) a security
vulnerability, (b) a bug that caused incorrect or invalid data to be produced)
: valid operations on valid data produce invalid data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]