youichi-uda opened a new issue, #9885:
URL: https://github.com/apache/arrow-rs/issues/9885

   # arrow-csv: integer overflow panic in `Reader::records::flush`
   
   The cargo-fuzz `csv_reader` harness being prototyped for #5332 found a clean 
panic in `arrow-csv` after a few minutes of running on `main` HEAD. Filing here 
because it's an actual bug surfaced by the proposed fuzzing infrastructure.
   
   ## Repro
   
   72 bytes:
   
   ```
   0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
   0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
   0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
   0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
   0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|
   ```
   
   (Same input feeds straight into 
`ReaderBuilder::new(Arc::new(schema)).with_format(format).build(Cursor::new(data))`
 and iterates batches; the panic happens inside the reader's flush, not in 
client code.)
   
   ## Panic
   
   ```
   thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
   attempt to add with overflow
      …
      3: <closure> at arrow-csv/src/reader/records.rs:207:21
      4: for_each over chunks_exact_mut row offsets
      5: <closure> at arrow-csv/src/reader/records.rs:206:32
      6: …
      9: arrow_csv::reader::records::*::flush
                 at arrow-csv/src/reader/records.rs:204:14
   ```
   
   The site is the `row.iter_mut().for_each(|x| { *x += offset; row_offset = 
*x; })` accumulation:
   
   ```rust
   // arrow-csv/src/reader/records.rs:201-210 (approx)
   let mut row_offset = 0;
   self.offsets[1..self.offsets_len]
       .chunks_exact_mut(self.num_columns)
       .for_each(|row| {
           let offset = row_offset;
           row.iter_mut().for_each(|x| {
               *x += offset;          // <-- panic here on overflow
               row_offset = *x;
           });
       });
   ```
   
   `*x` and `offset` are both `usize`. With pathological CSV input the per-row 
offsets grow without bound and `*x + offset` wraps past `usize::MAX`. The crash 
is reachable from any caller of `Reader::next()` on attacker-controlled bytes, 
so it's a DoS surface for any service that accepts CSV through `arrow-csv` 
directly.
   
   ## Suggested fix sketch
   
   Either:
   
   - **Bound the cumulative offset by the actual data length**: each emitted 
offset can never legitimately exceed `self.data_len`, so an explicit cap prior 
to the addition turns the overflow into a clean `ParseError`.
   - **Use `checked_add` on the accumulator** and surface the overflow as a 
`ParseError("CSV record offsets overflowed")` rather than letting it land at 
`panic!` in release mode (where overflow checks would otherwise be off and we'd 
hit silent UB / wrap-around).
   
   I haven't sent a PR for this since I have two related parquet 
allocation-bound PRs (#9883, #9884) already in flight and didn't want to flood 
the queue. Happy to send a fix as a follow-up once those land if no one else 
picks it up first.
   
   The 72-byte repro and the `arrow-csv/fuzz` harness that produced it live on 
my fork at 
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
 (per #5332). The harness reproduces this in single-digit minutes from an empty 
corpus.
   
   xref #5332 #9883 #9884
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to