hhhizzz commented on code in PR #8733:
URL: https://github.com/apache/arrow-rs/pull/8733#discussion_r2480294162


##########
parquet/src/column/reader.rs:
##########
@@ -214,6 +219,49 @@ where
             let remaining_records = max_records - total_records_read;
             let remaining_levels = self.num_buffered_values - 
self.num_decoded_values;
 
+            if self.synthetic_page {

Review Comment:
   Thanks for the thorough code review!
   Yes—this is the trickiest part of the PR. When no pages are skipped, 
everything works as expected. But some pages can be skipped during row-group 
construction, use the Sparse `ColumnChunkData`, meaning their values and 
definition/repetition levels are never read. Row selection still works because 
`skip_records()` handles this case and skips the page accordingly.
   
   However, with the Boolean-array design, all values must be read and decoded 
before filtering. `ParquetRecordBatchReader` is a streaming reader; it has no 
concept of pages, so we can’t rely on page size to drive skipping there. I 
think the most practical approach, therefore, is to return dummy null values as 
placeholders for the skipped pages. If I missed something or there's better way 
to do so, just let me know. 😊
   
   ## A simple example:
   
   the page size is 2, the mask is 100001, row selection should be read(1) 
skip(4) read(1)
   the ColumnChunkData would be page1(10), page2(skipped), page3(01)
   Using the `rowselection` to skip(4), the page2 won't be read at all.
   But using the bit mask, we need all 6 value be read, but the page2 is not in 
the memory, which is why I need to construct this synthetic page.
   
   ---
   For completeness, I prototyped reconstructing the readers to handle skipped 
pages directly, but it introduces a breaking change: every array_reader would 
need a page-size parameter. That’s undesirable—users shouldn’t need page-level 
details just to read Parquet.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to