marekgalovic opened a new issue, #4921: URL: https://github.com/apache/arrow-rs/issues/4921
**Describe the bug** I'm running into an issue where loading the page index in order to prune pages breaks page skipping for columns with nested types in [GenericColumnReader::skip_records](https://github.com/pinecone-io/arrow-rs/blob/pinecone-main/parquet/src/column/reader.rs#L331). This method assumes that if page metadata has `num_records` set, it is equal to the number of records that should be skipped which seems to be not true for columns with list-typed values because it does not account for repetition levels. Additionally, [SerializedPageReader::peek_next_page](https://github.com/pinecone-io/arrow-rs/blob/pinecone-main/parquet/src/file/serialized_reader.rs#L735) sets `num_rows` to a value based on `PageLocation` which does not consider repetition levels for the column either. This leads to `GenericColumnReader::skip_records` skipping over more pages than it should which causes the next read to fail with the [following error](https://github.com/pinecone-io/arrow-rs/blob/pinecone-main/parquet/src/arrow/arrow_reader/selection.rs#L339): ``` "selection contains less than the number of selected rows" ``` **To Reproduce** - Two row selectors. The first one filters on `uint64` and the second one on `List<utf8>`. - The value that it's failing to read is the last entry in the second column chunk (with `List<utf8>` type). - The first selector produces the following input selection for the second selector ``` RowSelection { selectors: [RowSelector { row_count: 16313, skip: true }, RowSelector { row_count: 3569, skip: false }, RowSelector { row_count: 48237, skip: true }, RowSelector { row_count: 6097, skip: false }, RowSelector { row_count: 25783, skip: true }, RowSelector { row_count: 1, skip: false }] } ``` - Row group size: 100000 - Page size: 1024 **Expected behavior** Based on the input selector, it should read 9667 rows but it only reads 9666 because the page that contains the last record gets unintentionally skipped. **Additional context** <!-- Add any other context about the problem here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
