RatulDawar opened a new pull request, #22857:
URL: https://github.com/apache/datafusion/pull/22857

   ## Which issue does this PR close?
   
   - Closes #22795
   
   ## Rationale for this change
   
   The Parquet opener was loading the page index (ColumnIndex + OffsetIndex) 
before row-group statistics pruning. When all surviving row groups are fully 
matched by row-group statistics (for example, `IS NOT NULL` on a non-null 
column), page index I/O cannot prune further and is wasted.
   
   ## What changes are included in this PR?
   
   - Reorder the opener state machine: `PrepareFilters → PruneWithStatistics → 
LoadPageIndex? → LoadBloomFilters`
   - Skip `load_page_index` when there is no page-pruning predicate, no 
surviving row groups, or every surviving row group is fully matched
   - Add unit and integration tests for the gate and the fully-matched `IS NOT 
NULL` case
   
   ## Are these changes tested?
   
   - `cargo test -p datafusion-datasource-parquet should_load`
   - `cargo test -p datafusion-datasource-parquet page_index_skip`
   - `cargo test -p datafusion-datasource-parquet 
opener::test::test_page_pruning`
   - `cargo test -p datafusion --test parquet_integration`
   - `cargo clippy -p datafusion-datasource-parquet --all-targets -- -D 
warnings`
   
   ## Are there any user-facing changes?
   
   No user-facing API changes. This reduces unnecessary Parquet page index I/O 
during scan planning when row-group statistics already prove no further pruning 
is possible.
   
   
   Made with [Cursor](https://cursor.com)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to