BoazC-MSFT opened a new issue, #9992:
URL: https://github.com/apache/arrow-rs/issues/9992
**Describe the bug**
When reading a Parquet file whose row group metadata reports more rows than
a column chunk actually contains, the record reader (`RowIter` /
`get_row_iter`) panics instead of returning an error. The panic message is
something like `index out of bounds: the len is 0 but the index is 70093` at
`parquet/src/record/triplet.rs` inside `TypedTripletIter::current_def_level`.
This happens because `TypedTripletIter::read_next` clears the internal
`def_levels`, `rep_levels`, and `values` buffers when the underlying column
reader is exhausted, but returns `Ok(false)` without resetting
`curr_triplet_index` to 0. The higher-level `Reader::read_field` then calls
`current_def_level()` unconditionally (in `OptionReader`, `RepeatedReader`, and
`KeyValueReader` variants), which indexes into an empty vector with the stale
index from the previous batch.
`ReaderIter` trusts `row_group.metadata().num_rows()` to drive iteration
without cross-checking whether the leaf column readers have actually been
exhausted, so a mismatch between metadata and actual data triggers the panic.
**To Reproduce**
Construct a `ReaderIter` via `TreeBuilder` with `num_records` set to one
more than the actual number of values in the column, and iterate to completion.
For example using `nulls.snappy.parquet` (8 rows):
```rust
let reader = TreeBuilder::new().build(descr, &*row_group_reader).unwrap();
let iter = ReaderIter::new(reader, 9).unwrap(); // actual data has 8 rows
for row in iter {
let _ = row.unwrap(); // panics on the 9th iteration
}
```
In production this is triggered by third-party Parquet files where the row
group footer declares a `num_rows` value larger than the actual encoded column
data.
**Expected behavior**
The iterator should return `Err` with a descriptive message like "Unexpected
end of column data" instead of panicking.
**Additional context**
The bug affects all Reader variants that call `current_def_level()` before
checking `has_next()`: `OptionReader`, `GroupReader` (for optional children),
`RepeatedReader`, and `KeyValueReader`. The `PrimitiveReader` variant is also
affected for required columns where `current_value()` indexes into the empty
`values` buffer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]