RatulDawar opened a new pull request, #22857: URL: https://github.com/apache/datafusion/pull/22857
## Which issue does this PR close? - Closes #22795 ## Rationale for this change The Parquet opener was loading the page index (ColumnIndex + OffsetIndex) before row-group statistics pruning. When all surviving row groups are fully matched by row-group statistics (for example, `IS NOT NULL` on a non-null column), page index I/O cannot prune further and is wasted. ## What changes are included in this PR? - Reorder the opener state machine: `PrepareFilters → PruneWithStatistics → LoadPageIndex? → LoadBloomFilters` - Skip `load_page_index` when there is no page-pruning predicate, no surviving row groups, or every surviving row group is fully matched - Add unit and integration tests for the gate and the fully-matched `IS NOT NULL` case ## Are these changes tested? - `cargo test -p datafusion-datasource-parquet should_load` - `cargo test -p datafusion-datasource-parquet page_index_skip` - `cargo test -p datafusion-datasource-parquet opener::test::test_page_pruning` - `cargo test -p datafusion --test parquet_integration` - `cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings` ## Are there any user-facing changes? No user-facing API changes. This reduces unnecessary Parquet page index I/O during scan planning when row-group statistics already prove no further pruning is possible. Made with [Cursor](https://cursor.com) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
