BoazC-MSFT opened a new pull request, #9993:
URL: https://github.com/apache/arrow-rs/pull/9993

   fix: prevent panic in record reader when row group metadata overcounts 
num_rows
   
   # Which issue does this PR close?
   
   - Closes #9992.
   
   # Rationale for this change
   
   The record reader (`RowIter` / `get_row_iter`) panics with `index out of 
bounds` when a Parquet file's row group metadata declares more rows than a 
column chunk actually contains. This happens in production when reading 
third-party Parquet files with mismatched metadata. Instead of panicking, the 
reader should return an error.
   
   # What changes are included in this PR?
   
   Three layers of fix in `parquet/src/record/`:
   
   **triplet.rs - fix the inconsistent internal state:**
   - Reset `curr_triplet_index` to 0 in the exhaustion path of `read_next`, so 
the stale index from the previous batch never persists alongside empty buffers.
   - Return 0 from `current_def_level` and `current_rep_level` when `has_next` 
is false, as defense-in-depth against any caller that skips the `has_next` 
check.
   
   **reader.rs - return errors instead of panicking:**
   - Add `has_next()` guards before consuming column data in all `read_field` 
variants: `PrimitiveReader`, `OptionReader`, `RepeatedReader`, and 
`KeyValueReader`. When a column is exhausted mid-iteration, `read_field` now 
returns `Err("Unexpected end of column data")` which propagates through 
`ReaderIter::next` as `Some(Err(...))`.
   
   # Are these changes tested?
   
   Yes. Five new tests:
   
   - `test_current_def_level_safe_after_exhaustion` - drives a `TripletIter` to 
exhaustion on an optional column and asserts `current_def_level()` returns 0 
instead of panicking.
   - `test_current_rep_level_safe_after_exhaustion` - same for 
`current_rep_level()` on a repeated column.
   - `test_reader_iter_returns_error_when_num_records_exceeds_data` - exercises 
the full `ReaderIter` stack with an optional field (via `nulls.snappy.parquet`).
   - 
`test_reader_iter_returns_error_for_repeated_field_when_num_records_exceeds_data`
 - same for a repeated primitive field (via 
`repeated_primitive_no_list.parquet`).
   - 
`test_reader_iter_returns_error_for_map_field_when_num_records_exceeds_data` - 
same for a map field projected alone (via `map_no_value.parquet`).
   
   Each integration test inflates `num_records` by 1 beyond actual data, 
asserts all real rows return `Ok`, and asserts the extra iteration returns 
`Err` containing "Unexpected end of column data".
   
   # Are there any user-facing changes?
   
   Callers of `get_row_iter` or `RowIter` that previously hit a panic on 
corrupt/truncated files will now receive an `Err` from the iterator instead. No 
API signature changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to