(arrow-rs) branch main updated: fix(arrow-csv): bound RecordDecoder::flush offset accumulation (#9886)

alamb Wed, 06 May 2026 08:07:36 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git



The following commit(s) were added to refs/heads/main by this push:
     new ded985c95e fix(arrow-csv): bound RecordDecoder::flush offset 
accumulation (#9886)
ded985c95e is described below

commit ded985c95e6d132567710319d21e1901973ea16f
Author: masumi-ryugo <[email protected]>
AuthorDate: Thu May 7 00:06:53 2026 +0900

    fix(arrow-csv): bound RecordDecoder::flush offset accumulation (#9886)
    
    Closes #9885.
    
    ## What
    
    `RecordDecoder::flush` walks the per-row offsets emitted by
    `csv_core::Reader` and accumulates them so each end offset is absolute
    over `self.data` after the loop. The accumulator was a plain `usize` and
    the loop body did `*x += offset`, which on malformed input that drives
    `csv_core` to emit row-relative offsets large enough to wrap a `usize`:
    
    - panics with `attempt to add with overflow` in debug builds (and the
    cargo-fuzz `csv_reader` harness that found this is built with
    `--debug-assertions`);
    - silently wraps to a wildly out-of-bounds index in release builds,
    which then trips an unrelated `assert!` / `unwrap` somewhere downstream.
    
    ## Fix
    
    Switch the accumulator to `checked_add` and surface the overflow as
    `ArrowError::CsvError` instead. The body of the loop becomes a normal
    `for` loop because `?` doesn't compose with the previous closure form.
    
    ```rust
    let mut row_offset: usize = 0;
    for row in 
self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
        let offset = row_offset;
        for x in row.iter_mut() {
            *x = x.checked_add(offset).ok_or_else(|| {
                ArrowError::CsvError(
                    "CSV record offsets overflowed usize while 
flushing".to_string(),
                )
            })?;
            row_offset = *x;
        }
    }
    ```
    
    ## Repro
    
    The cargo-fuzz `csv_reader` harness from
    
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
    (per #5332) reproduces this from an empty corpus in single-digit
    minutes. The minimized repro is 72 bytes:
    
    ```
    0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
    0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
    0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
    0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
    0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|
    ```
    
    Before this PR (run on `main` HEAD against the cargo-fuzz harness):
    ```
    thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
    attempt to add with overflow
    ```
    
    After this PR the same 72 bytes pass through the fuzz target in 40 ms
    with exit 0; the API now returns `ArrowError::CsvError(...)` for callers
    to handle.
    
    ## Tests
    
    Adds
    `reader::records::tests::test_flush_offset_overflow_does_not_panic`,
    which feeds the 72-byte fuzz repro through `RecordDecoder::decode` +
    `flush` and asserts the loop terminates cleanly instead of panicking.
    The existing 4 tests in that module continue to pass.
    
    ## Alternatives considered
    
    - **Cap by `self.data_len`**: each emitted offset is supposed to be ≤
    `self.data_len`, so an explicit cap would also turn the overflow into a
    clean error. I went with `checked_add` because it's the more targeted
    change — it doesn't add a new invariant on `csv_core`'s output, only
    refuses to compute something that would have been arithmetically
    nonsensical anyway.
    - **Use `saturating_add`**: would silently truncate the offset and then
    mis-slice `self.data`, producing a confusing `Encountered invalid UTF-8
    data` error or panic deeper in the call stack. Worse signal.
    
    xref #5332 #9883 #9884
    
    ---------
    
    Co-authored-by: masumi ryugo 
<[email protected]>
    Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
---
 arrow-csv/src/reader/records.rs | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/arrow-csv/src/reader/records.rs b/arrow-csv/src/reader/records.rs
index 33927c9336..a6d54f867c 100644
--- a/arrow-csv/src/reader/records.rs
+++ b/arrow-csv/src/reader/records.rs
@@ -198,16 +198,21 @@ impl RecordDecoder {
 
         // csv_core::Reader writes end offsets relative to the start of the row
         // Therefore scan through and offset these based on the cumulative row 
offsets
-        let mut row_offset = 0;
+        let mut row_offset: usize = 0;
         self.offsets[1..self.offsets_len]
             .chunks_exact_mut(self.num_columns)
-            .for_each(|row| {
+            .try_for_each(|row| -> Result<(), ArrowError> {
                 let offset = row_offset;
-                row.iter_mut().for_each(|x| {
-                    *x += offset;
+                row.iter_mut().try_for_each(|x| -> Result<(), ArrowError> {
+                    *x = x.checked_add(offset).ok_or_else(|| {
+                        ArrowError::CsvError(
+                            "CSV record offsets overflowed usize while 
flushing".to_string(),
+                        )
+                    })?;
                     row_offset = *x;
-                });
-            });
+                    Ok(())
+                })
+            })?;
 
         // Need to truncate data t1o the actual amount of data read
         let data = 
std::str::from_utf8(&self.data[..self.data_len]).map_err(|e| {
@@ -399,4 +404,23 @@ mod tests {
         assert_eq!(read, 5);
         assert_eq!(bytes, csv.len());
     }
+
+    /// Regression test for an overflow path found by the `arrow-csv`
+    /// cargo-fuzz harness being prototyped for #5332. Stages the
+    /// `RecordDecoder` state directly so that rebasing the second row's
+    /// end offset overflows `usize`. With the previous `*x += offset` this
+    /// panicked with `attempt to add with overflow`; the patched code
+    /// surfaces the condition as `ArrowError::CsvError`.
+    #[test]
+    fn test_flush_offset_overflow_returns_csv_error() {
+        let mut decoder = RecordDecoder::new(Reader::new(), 1, false);
+        decoder.offsets = vec![0, usize::MAX, 1];
+        decoder.offsets_len = 3;
+        decoder.num_rows = 2;
+        let err = decoder.flush().unwrap_err();
+        assert_eq!(
+            err.to_string(),
+            "Csv error: CSV record offsets overflowed usize while flushing"
+        );
+    }
 }

(arrow-rs) branch main updated: fix(arrow-csv): bound RecordDecoder::flush offset accumulation (#9886)

Reply via email to