rdblue commented on PR #6935: URL: https://github.com/apache/iceberg/pull/6935#issuecomment-1445436703
> The readFilteredRowGroup method provided by Parquet will detect whether there is a filter pushed down, and only return the filtered row-group when there is a push-down filter. I commented where we set the row ranges for the row group. I think that should work with Parquet, but it's been a while since I looked at it. Getting a public API call in would make it easier. > I think RowRanges is also at the row [group level](https://github.com/apache/parquet-mr/blob/c9cfe821448a2f99797fda7f46c70a16cc1250a9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java#L33), Parquet-mr will [generate](https://github.com/apache/parquet-mr/blob/c9cfe821448a2f99797fda7f46c70a16cc1250a9/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1142%EF%BC%89) a RowRanges for each row-group when running the column index filter. Great! I put this together by trying to reverse engineer what was going on in Parquet, so I must have gotten it right. > We will have to wait for the next version to use, but Parquet-mr may have a release in a month, see this comment , we should be able to catch up with this release. If you agree, I can open a PR in the Parquet-mr repo. I'm all for adding what we need to Parquet. We can continue to use reflection until it is available. I think the main thing is that we don't currently handle the row ranges after we've skipped reading the pages. So the next steps are to verify what's in this PR and then to update the read paths so that values are skipped for the skipped rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
