marekgalovic opened a new issue, #4921:
URL: https://github.com/apache/arrow-rs/issues/4921

   **Describe the bug**
   I'm running into an issue where loading the page index in order to prune 
pages breaks page skipping for columns with nested types in 
[GenericColumnReader::skip_records](https://github.com/pinecone-io/arrow-rs/blob/pinecone-main/parquet/src/column/reader.rs#L331).
 This method assumes that if page metadata has `num_records` set, it is equal 
to the number of records that should be skipped which seems to be not true for 
columns with list-typed values because it does not account for repetition 
levels. Additionally, 
[SerializedPageReader::peek_next_page](https://github.com/pinecone-io/arrow-rs/blob/pinecone-main/parquet/src/file/serialized_reader.rs#L735)
 sets `num_rows` to a value based on `PageLocation` which does not consider 
repetition levels for the column either.
   
   This leads to `GenericColumnReader::skip_records` skipping over more pages 
than it should which causes the next read to fail with the [following 
error](https://github.com/pinecone-io/arrow-rs/blob/pinecone-main/parquet/src/arrow/arrow_reader/selection.rs#L339):
   ```
   "selection contains less than the number of selected rows"
   ```
   
   **To Reproduce**
   - Two row selectors. The first one filters on `uint64` and the second one on 
`List<utf8>`.
   - The value that it's failing to read is the last entry in the second column 
chunk (with `List<utf8>` type).
   - The first selector produces the following input selection for the second 
selector
   ```
   RowSelection { selectors: [RowSelector { row_count: 16313, skip: true }, 
RowSelector { row_count: 3569, skip: false }, RowSelector { row_count: 48237, 
skip: true }, RowSelector { row_count: 6097, skip: false }, RowSelector { 
row_count: 25783, skip: true }, RowSelector { row_count: 1, skip: false }] }
   ```
   - Row group size: 100000
   - Page size: 1024
   
   **Expected behavior**
   Based on the input selector, it should read 9667 rows but it only reads 9666 
because the page that contains the last record gets unintentionally skipped.
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to