masumi-ryugo opened a new pull request, #9886:
URL: https://github.com/apache/arrow-rs/pull/9886
Closes #9885.
## What
`RecordDecoder::flush` walks the per-row offsets emitted by
`csv_core::Reader` and accumulates them so each end offset is absolute over
`self.data` after the loop. The accumulator was a plain `usize` and the loop
body did `*x += offset`, which on malformed input that drives `csv_core` to
emit row-relative offsets large enough to wrap a `usize`:
- panics with `attempt to add with overflow` in debug builds (and the
cargo-fuzz `csv_reader` harness that found this is built with
`--debug-assertions`);
- silently wraps to a wildly out-of-bounds index in release builds, which
then trips an unrelated `assert!` / `unwrap` somewhere downstream.
## Fix
Switch the accumulator to `checked_add` and surface the overflow as
`ArrowError::CsvError` instead. The body of the loop becomes a normal `for`
loop because `?` doesn't compose with the previous closure form.
```rust
let mut row_offset: usize = 0;
for row in
self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
let offset = row_offset;
for x in row.iter_mut() {
*x = x.checked_add(offset).ok_or_else(|| {
ArrowError::CsvError(
"CSV record offsets overflowed usize while
flushing".to_string(),
)
})?;
row_offset = *x;
}
}
```
## Repro
The cargo-fuzz `csv_reader` harness from
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
(per #5332) reproduces this from an empty corpus in single-digit minutes. The
minimized repro is 72 bytes:
```
0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1|
0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.|
0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....|
0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.|
0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i|
```
Before this PR (run on `main` HEAD against the cargo-fuzz harness):
```
thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
```
After this PR the same 72 bytes pass through the fuzz target in 40 ms with
exit 0; the API now returns `ArrowError::CsvError(...)` for callers to handle.
## Tests
Adds `reader::records::tests::test_flush_offset_overflow_does_not_panic`,
which feeds the 72-byte fuzz repro through `RecordDecoder::decode` + `flush`
and asserts the loop terminates cleanly instead of panicking. The existing 4
tests in that module continue to pass.
## Alternatives considered
- **Cap by `self.data_len`**: each emitted offset is supposed to be ≤
`self.data_len`, so an explicit cap would also turn the overflow into a clean
error. I went with `checked_add` because it's the more targeted change — it
doesn't add a new invariant on `csv_core`'s output, only refuses to compute
something that would have been arithmetically nonsensical anyway.
- **Use `saturating_add`**: would silently truncate the offset and then
mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8 data`
error or panic deeper in the call stack. Worse signal.
xref #5332 #9883 #9884
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]