BoazC-MSFT opened a new issue, #9992:
URL: https://github.com/apache/arrow-rs/issues/9992

   **Describe the bug**
   
   When reading a Parquet file whose row group metadata reports more rows than 
a column chunk actually contains, the record reader (`RowIter` / 
`get_row_iter`) panics instead of returning an error. The panic message is 
something like `index out of bounds: the len is 0 but the index is 70093` at 
`parquet/src/record/triplet.rs` inside `TypedTripletIter::current_def_level`.
   
   This happens because `TypedTripletIter::read_next` clears the internal 
`def_levels`, `rep_levels`, and `values` buffers when the underlying column 
reader is exhausted, but returns `Ok(false)` without resetting 
`curr_triplet_index` to 0. The higher-level `Reader::read_field` then calls 
`current_def_level()` unconditionally (in `OptionReader`, `RepeatedReader`, and 
`KeyValueReader` variants), which indexes into an empty vector with the stale 
index from the previous batch.
   
   `ReaderIter` trusts `row_group.metadata().num_rows()` to drive iteration 
without cross-checking whether the leaf column readers have actually been 
exhausted, so a mismatch between metadata and actual data triggers the panic.
   
   **To Reproduce**
   
   Construct a `ReaderIter` via `TreeBuilder` with `num_records` set to one 
more than the actual number of values in the column, and iterate to completion. 
For example using `nulls.snappy.parquet` (8 rows):
   
   ```rust
   let reader = TreeBuilder::new().build(descr, &*row_group_reader).unwrap();
   let iter = ReaderIter::new(reader, 9).unwrap(); // actual data has 8 rows
   for row in iter {
       let _ = row.unwrap(); // panics on the 9th iteration
   }
   ```
   
   In production this is triggered by third-party Parquet files where the row 
group footer declares a `num_rows` value larger than the actual encoded column 
data.
   
   **Expected behavior**
   
   The iterator should return `Err` with a descriptive message like "Unexpected 
end of column data" instead of panicking.
   
   **Additional context**
   
   The bug affects all Reader variants that call `current_def_level()` before 
checking `has_next()`: `OptionReader`, `GroupReader` (for optional children), 
`RepeatedReader`, and `KeyValueReader`. The `PrimitiveReader` variant is also 
affected for required columns where `current_value()` indexes into the empty 
`values` buffer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to