jonded94 commented on issue #9370: URL: https://github.com/apache/arrow-rs/issues/9370#issuecomment-3861327615
I have to go now, but this is what Claude spat out. Gonna open a PR later
● The test passes. Let me update the todo list and provide a summary of the
fix.
● The fix works. Here's a summary:
Bug Summary
The issue was in
/home/anrs/workspace/arrow-rs/parquet/src/column/reader.rs:287-382
(skip_records function).
Root cause: When skip_records completed successfully after crossing page
boundaries, the has_partial state in the RepetitionLevelDecoder could
incorrectly remain true. This happened when:
1. The skip operation exhausted a page where has_record_delimiter was
false
2. The skip found the remaining records on the next page by counting a
delimiter at index 0
When a subsequent read_records(1) was called, the stale has_partial=true
state caused count_records to incorrectly interpret the first repetition level
(0) at index 0 as ending a "phantom" partial record, returning (1 record, 0
levels, 0 values) instead of properly reading the
actual record data.
The Fix
Added code at the end of skip_records to reset the partial record state
when all requested records have been successfully skipped:
// After successfully skipping all requested records, we are positioned at
a record
// boundary. Reset any partial record state to ensure subsequent read
operations
// start fresh. This handles edge cases where the skip operation crossed
page
// boundaries and the has_partial state was not properly reset.
if remaining_records == 0 {
if let Some(decoder) = self.rep_level_decoder.as_mut() {
decoder.flush_partial();
}
}
This ensures that after skip_records completes, we're at a clean record
boundary with no lingering partial record state, fixing the array length
mismatch in StructArrayReader.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
