This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new ded985c95e fix(arrow-csv): bound RecordDecoder::flush offset
accumulation (#9886)
ded985c95e is described below
commit ded985c95e6d132567710319d21e1901973ea16f
Author: masumi-ryugo <[email protected]>
AuthorDate: Thu May 7 00:06:53 2026 +0900
fix(arrow-csv): bound RecordDecoder::flush offset accumulation (#9886)
Closes #9885.
## What
`RecordDecoder::flush` walks the per-row offsets emitted by
`csv_core::Reader` and accumulates them so each end offset is absolute
over `self.data` after the loop. The accumulator was a plain `usize` and
the loop body did `*x += offset`, which on malformed input that drives
`csv_core` to emit row-relative offsets large enough to wrap a `usize`:
- panics with `attempt to add with overflow` in debug builds (and the
cargo-fuzz `csv_reader` harness that found this is built with
`--debug-assertions`);
- silently wraps to a wildly out-of-bounds index in release builds,
which then trips an unrelated `assert!` / `unwrap` somewhere downstream.
## Fix
Switch the accumulator to `checked_add` and surface the overflow as
`ArrowError::CsvError` instead. The body of the loop becomes a normal
`for` loop because `?` doesn't compose with the previous closure form.
```rust
let mut row_offset: usize = 0;
for row in
self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
let offset = row_offset;
for x in row.iter_mut() {
*x = x.checked_add(offset).ok_or_else(|| {
ArrowError::CsvError(
"CSV record offsets overflowed usize while
flushing".to_string(),
)
})?;
row_offset = *x;
}
}
```
## Repro
The cargo-fuzz `csv_reader` harness from
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
(per #5332) reproduces this from an empty corpus in single-digit
minutes. The minimized repro is 72 bytes:
```
0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1|
0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.|
0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....|
0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.|
0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i|
```
Before this PR (run on `main` HEAD against the cargo-fuzz harness):
```
thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
```
After this PR the same 72 bytes pass through the fuzz target in 40 ms
with exit 0; the API now returns `ArrowError::CsvError(...)` for callers
to handle.
## Tests
Adds
`reader::records::tests::test_flush_offset_overflow_does_not_panic`,
which feeds the 72-byte fuzz repro through `RecordDecoder::decode` +
`flush` and asserts the loop terminates cleanly instead of panicking.
The existing 4 tests in that module continue to pass.
## Alternatives considered
- **Cap by `self.data_len`**: each emitted offset is supposed to be ≤
`self.data_len`, so an explicit cap would also turn the overflow into a
clean error. I went with `checked_add` because it's the more targeted
change — it doesn't add a new invariant on `csv_core`'s output, only
refuses to compute something that would have been arithmetically
nonsensical anyway.
- **Use `saturating_add`**: would silently truncate the offset and then
mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8
data` error or panic deeper in the call stack. Worse signal.
xref #5332 #9883 #9884
---------
Co-authored-by: masumi ryugo
<[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
---
arrow-csv/src/reader/records.rs | 36 ++++++++++++++++++++++++++++++------
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/arrow-csv/src/reader/records.rs b/arrow-csv/src/reader/records.rs
index 33927c9336..a6d54f867c 100644
--- a/arrow-csv/src/reader/records.rs
+++ b/arrow-csv/src/reader/records.rs
@@ -198,16 +198,21 @@ impl RecordDecoder {
// csv_core::Reader writes end offsets relative to the start of the row
// Therefore scan through and offset these based on the cumulative row
offsets
- let mut row_offset = 0;
+ let mut row_offset: usize = 0;
self.offsets[1..self.offsets_len]
.chunks_exact_mut(self.num_columns)
- .for_each(|row| {
+ .try_for_each(|row| -> Result<(), ArrowError> {
let offset = row_offset;
- row.iter_mut().for_each(|x| {
- *x += offset;
+ row.iter_mut().try_for_each(|x| -> Result<(), ArrowError> {
+ *x = x.checked_add(offset).ok_or_else(|| {
+ ArrowError::CsvError(
+ "CSV record offsets overflowed usize while
flushing".to_string(),
+ )
+ })?;
row_offset = *x;
- });
- });
+ Ok(())
+ })
+ })?;
// Need to truncate data t1o the actual amount of data read
let data =
std::str::from_utf8(&self.data[..self.data_len]).map_err(|e| {
@@ -399,4 +404,23 @@ mod tests {
assert_eq!(read, 5);
assert_eq!(bytes, csv.len());
}
+
+ /// Regression test for an overflow path found by the `arrow-csv`
+ /// cargo-fuzz harness being prototyped for #5332. Stages the
+ /// `RecordDecoder` state directly so that rebasing the second row's
+ /// end offset overflows `usize`. With the previous `*x += offset` this
+ /// panicked with `attempt to add with overflow`; the patched code
+ /// surfaces the condition as `ArrowError::CsvError`.
+ #[test]
+ fn test_flush_offset_overflow_returns_csv_error() {
+ let mut decoder = RecordDecoder::new(Reader::new(), 1, false);
+ decoder.offsets = vec![0, usize::MAX, 1];
+ decoder.offsets_len = 3;
+ decoder.num_rows = 2;
+ let err = decoder.flush().unwrap_err();
+ assert_eq!(
+ err.to_string(),
+ "Csv error: CSV record offsets overflowed usize while flushing"
+ );
+ }
}