[PR] fix(arrow-csv): bound RecordDecoder::flush offset accumulation [arrow-rs]

via GitHub Sun, 03 May 2026 08:35:19 -0700


masumi-ryugo opened a new pull request, #9886:
URL: https://github.com/apache/arrow-rs/pull/9886


   Closes #9885.
   
   ## What
   
   `RecordDecoder::flush` walks the per-row offsets emitted by 
`csv_core::Reader` and accumulates them so each end offset is absolute over 
`self.data` after the loop. The accumulator was a plain `usize` and the loop 
body did `*x += offset`, which on malformed input that drives `csv_core` to 
emit row-relative offsets large enough to wrap a `usize`:
   
   - panics with `attempt to add with overflow` in debug builds (and the 
cargo-fuzz `csv_reader` harness that found this is built with 
`--debug-assertions`);
   - silently wraps to a wildly out-of-bounds index in release builds, which 
then trips an unrelated `assert!` / `unwrap` somewhere downstream.
   
   ## Fix
   
   Switch the accumulator to `checked_add` and surface the overflow as 
`ArrowError::CsvError` instead. The body of the loop becomes a normal `for` 
loop because `?` doesn't compose with the previous closure form.
   
   ```rust
   let mut row_offset: usize = 0;
   for row in 
self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
       let offset = row_offset;
       for x in row.iter_mut() {
           *x = x.checked_add(offset).ok_or_else(|| {
               ArrowError::CsvError(
                   "CSV record offsets overflowed usize while 
flushing".to_string(),
               )
           })?;
           row_offset = *x;
       }
   }
   ```
   
   ## Repro
   
   The cargo-fuzz `csv_reader` harness from 
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
 (per #5332) reproduces this from an empty corpus in single-digit minutes. The 
minimized repro is 72 bytes:
   
   ```
   0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
   0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
   0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
   0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
   0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|
   ```
   
   Before this PR (run on `main` HEAD against the cargo-fuzz harness):
   ```
   thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
   attempt to add with overflow
   ```
   
   After this PR the same 72 bytes pass through the fuzz target in 40 ms with 
exit 0; the API now returns `ArrowError::CsvError(...)` for callers to handle.
   
   ## Tests
   
   Adds `reader::records::tests::test_flush_offset_overflow_does_not_panic`, 
which feeds the 72-byte fuzz repro through `RecordDecoder::decode` + `flush` 
and asserts the loop terminates cleanly instead of panicking. The existing 4 
tests in that module continue to pass.
   
   ## Alternatives considered
   
   - **Cap by `self.data_len`**: each emitted offset is supposed to be ≤ 
`self.data_len`, so an explicit cap would also turn the overflow into a clean 
error. I went with `checked_add` because it's the more targeted change — it 
doesn't add a new invariant on `csv_core`'s output, only refuses to compute 
something that would have been arithmetically nonsensical anyway.
   - **Use `saturating_add`**: would silently truncate the offset and then 
mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8 data` 
error or panic deeper in the call stack. Worse signal.
   
   xref #5332 #9883 #9884
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(arrow-csv): bound RecordDecoder::flush offset accumulation [arrow-rs]

Reply via email to