youichi-uda opened a new issue, #9885:
URL: https://github.com/apache/arrow-rs/issues/9885
# arrow-csv: integer overflow panic in `Reader::records::flush`
The cargo-fuzz `csv_reader` harness being prototyped for #5332 found a clean
panic in `arrow-csv` after a few minutes of running on `main` HEAD. Filing here
because it's an actual bug surfaced by the proposed fuzzing infrastructure.
## Repro
72 bytes:
```
0000 2e 22 3f 0a 31 0a 3f 3f 0a 3c 50 50 0a 3f 0a 31 |."?.1.??.<PP.?.1|
0010 0a 3f 38 0a 3c 0a 3f 0a 3c 50 50 0a 3f 0a 31 0a |.?8.<.?.<PP.?.1.|
0020 3f 38 0a 0a 2e 22 3f 0a 31 0a 3f 3f 0a ce ce ce |?8..."?.1.??....|
0030 b1 ce ce ce ce ce ce ce ce 31 0a 3f 38 0a 3c 0a |.........1.?8.<.|
0040 3f 0a 3c 0a 3f 0a 3f 69 |?.<.?.?i|
```
(Same input feeds straight into
`ReaderBuilder::new(Arc::new(schema)).with_format(format).build(Cursor::new(data))`
and iterates batches; the panic happens inside the reader's flush, not in
client code.)
## Panic
```
thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow
…
3: <closure> at arrow-csv/src/reader/records.rs:207:21
4: for_each over chunks_exact_mut row offsets
5: <closure> at arrow-csv/src/reader/records.rs:206:32
6: …
9: arrow_csv::reader::records::*::flush
at arrow-csv/src/reader/records.rs:204:14
```
The site is the `row.iter_mut().for_each(|x| { *x += offset; row_offset =
*x; })` accumulation:
```rust
// arrow-csv/src/reader/records.rs:201-210 (approx)
let mut row_offset = 0;
self.offsets[1..self.offsets_len]
.chunks_exact_mut(self.num_columns)
.for_each(|row| {
let offset = row_offset;
row.iter_mut().for_each(|x| {
*x += offset; // <-- panic here on overflow
row_offset = *x;
});
});
```
`*x` and `offset` are both `usize`. With pathological CSV input the per-row
offsets grow without bound and `*x + offset` wraps past `usize::MAX`. The crash
is reachable from any caller of `Reader::next()` on attacker-controlled bytes,
so it's a DoS surface for any service that accepts CSV through `arrow-csv`
directly.
## Suggested fix sketch
Either:
- **Bound the cumulative offset by the actual data length**: each emitted
offset can never legitimately exceed `self.data_len`, so an explicit cap prior
to the addition turns the overflow into a clean `ParseError`.
- **Use `checked_add` on the accumulator** and surface the overflow as a
`ParseError("CSV record offsets overflowed")` rather than letting it land at
`panic!` in release mode (where overflow checks would otherwise be off and we'd
hit silent UB / wrap-around).
I haven't sent a PR for this since I have two related parquet
allocation-bound PRs (#9883, #9884) already in flight and didn't want to flood
the queue. Happy to send a fix as a follow-up once those land if no one else
picks it up first.
The 72-byte repro and the `arrow-csv/fuzz` harness that produced it live on
my fork at
[`fuzz/initial-harnesses`](https://github.com/masumi-ryugo/arrow-rs/tree/fuzz/initial-harnesses)
(per #5332). The harness reproduces this in single-digit minutes from an empty
corpus.
xref #5332 #9883 #9884
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]