rdblue opened a new pull request, #6935: URL: https://github.com/apache/iceberg/pull/6935
This adds an evaluator, `ParquetIndexPageFilter.EvalVisitor` to evaluate an Iceberg Expression against Parquet's page indexes. That produces Parquet's `RowRanges`, which track ranges of rows that should be read from the Parquet file. The filter class, `ParquetIndexPageFilter` also sets the row ranges on the `ParquetFileReader`. Parquet doesn't expose some of the `RowRanges` methods publicly, so this has a `ParquetRanges` class that uses reflection to construct the ranges and to use them to filter `ParquetFileReader`. Also, **this does not update record materialization**. Record materialization will need to coordinate the values read across pages. Currently, it assumes that the pages returned are in sync, but that will change when Parquet filters pages. For example, if column A has just one page, `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]` and column B has two pages, `[a, b, c, d, e]` and `[f, g, h, i, j, k]`, and the row range is `6-8` then the results should be `(6, g)`, `(7, h)`, and `(8, i)`. The reader for column A will get the entire page and need to skip 6 values (0, 1, 2, 3, 4, and 5), then the reader for column B will get just the second page and need to skip 1 value (f) to get in sync. The readers will need to handle this. I did this work a while ago and after looking at it with fresh eyes today, I have a couple concerns: 1. I didn't get to writing tests for the `RowRanges` that are produced. This should be done next. 2. The filter is applied at the row group level, but it looks like that's not how `RowRanges` works in Parquet, so this probably needs to be refactored to run the filter, produce `RowRanges`, and set them on the `ParquetFileReader` just once. In addition, if any row groups are completely eliminated by the filter, this will need to account for the missing row groups in the Iceberg Parquet reader. 3. Since the filter is applied at the row group level, `allRows` may be incorrect (it is the row group's row count) and need to be adjusted to be the entire file rather than specific to a row group. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
