mbutrovich opened a new issue, #22795:
URL: https://github.com/apache/datafusion/issues/22795

   ## Is your feature request related to a problem or challenge?
   
   The Parquet opener loads the page index (ColumnIndex plus OffsetIndex) for 
any file whose scan has a page-pruning predicate, before it knows whether the 
page index can prune anything. For predicates that row-group statistics already 
resolve, this is pure I/O and parsing overhead that prunes zero pages.
   
   The clearest case is `IS NOT NULL` on a column that has no nulls. In 
`datafusion/pruning`, `IS NOT NULL` pruning rewrites to `null_count != 
row_count`, so a container is pruned only when it is entirely null. On a 
non-null column no page is ever all-null, so the page index is loaded and 
prunes nothing. On a wide fact table scanned with `IS NOT NULL` filters on 
non-null join keys, this adds roughly 280 KB of page index per file. Across 
tens of thousands of files that is gigabytes of wasted reads.
   
   This surfaced downstream in DataFusion Comet (apache/datafusion-comet#3978): 
a TPC-DS q88 scan loads about 2.8 GB of page index for `IS NOT NULL` filters on 
non-null foreign keys, pruning nothing.
   
   ## Describe the solution you'd like
   
   Gate the page index load on whether row-group statistics leave any work for 
it to do.
   
   Row-group pruning sorts each row group into one of three buckets:
   
   1. **Pruned**: RG statistics prove no row matches. The whole row group is 
dropped and the page index is irrelevant.
   2. **Fully matched**: RG statistics prove every row matches. The page index 
cannot prune anything (justified below).
   3. **Inconclusive**: RG statistics prove neither. Some rows might match and 
some might not.
   
   The page index can only prune in bucket 3. Page-index pruning removes a page 
if and only if the predicate is provably false for every row on that page. A 
page is a subset of the row group's rows. In bucket 2 the predicate is provably 
true for every row in the row group, so it is true for every row on every page, 
so no page can be all-non-matching and no page is prunable. There is nothing 
left to refine. In bucket 3 there exist possibly-non-matching rows that may be 
concentrated on some pages the page index can isolate, so the page index does 
refine and must be loaded.
   
   So the rule is: **skip the page index load only when every surviving row 
group is in bucket 2 (fully matched). A single bucket 3 row group forces the 
load.** Note that "row group could not be pruned" is the wrong condition, 
because it merges buckets 2 and 3.
   
   DataFusion already computes the relevant signal. PR #21637 added "fully 
matched" detection and uses it to skip page-index pruning work for 
fully-matched row groups. For `IS NOT NULL`, a row group with `null_count == 0` 
is fully matched.
   
   The gap is ordering. The opener state machine 
(`datafusion/datasource-parquet/src/opener/mod.rs`) runs:
   
   ```
   LoadMetadata (footer, PageIndexPolicy::Skip)
     -> PrepareFilters
     -> LoadPageIndex            // page index I/O happens here
     -> PruneWithStatistics      // row-group stats pruning / fully-matched 
decided here
     -> ...
   ```
   
   `LoadPageIndex` runs before `PruneWithStatistics`, so the fully-matched 
determination that would prove the page index useless happens after the bytes 
are already fetched. The existing optimization saves CPU (skips page-index 
pruning work) but not I/O.
   
   Proposed change: make the fully-matched determination available before the 
page index load, and skip `load_page_index` when every surviving row group is 
fully matched by the page-pruning predicate using row-group statistics alone. 
Row-group statistics are present in the footer already loaded under 
`PageIndexPolicy::Skip`, so no extra I/O is required to make this decision.
   
   Concretely for the `IS NOT NULL` case: skip the load when, for every 
referenced column, the row-group statistics report `null_count == Some(0)`.
   
   ## Describe alternatives you've considered
   
   - Classify the page-pruning predicate by which statistics it uses 
(`StatisticsType` in the pruning predicate's `RequiredColumns`) and skip the 
load when it references only `NullCount` / `RowCount` and never `Min` / `Max`. 
This is narrower than the fully-matched approach and still needs the row-group 
null-count gate, so the fully-matched route is preferred because it already 
exists and covers more predicate shapes.
   
   - Cache the full metadata including the page index so repeated opens of the 
same file pay the load only once. This helps when the page index is actually 
useful but does not help the non-selective case, where the cheapest fix is to 
not load it at all.
   
   ## Additional context
   
   Correctness notes for the gate:
   
   - **Fully matched must be null-aware.** For a predicate that rejects nulls, 
such as `x > 50`, fully matched requires `min_value > 50` and `null_count == 
0`. If the null count is positive, an all-null page would be pruned by `x > 
50`, so the page index still has value and the load must not be skipped. The 
gate is only as correct as the underlying fully-matched computation's null 
handling, so it must depend on the null-aware definition. This should be 
verified in the #21637 logic before relying on it.
   
   - **Missing statistics fall back to loading.** `Statistics.null_count` is 
`optional` in the Parquet thrift spec, and a column chunk may carry no 
`Statistics` at all. Treat a missing `null_count` (or missing statistics) as 
"not provably zero" and load the page index. The `IS NOT NULL` skip condition 
is therefore "statistics present and `null_count == Some(0)` for all referenced 
columns," conservatively false otherwise. Modern writers emit row-group 
`null_count` in practice, so the common case still benefits.
   
   - **The fully-matched determination must use row-group statistics only**, 
never the page index, since the whole point is to decide whether to load the 
page index.
   
   - **The change is a reorder of the opener state machine** so that 
row-group-stats pruning / fully-matched runs before the page index load. The 
staged structs (`FiltersPreparedParquetOpen`, `RowGroupsPrunedParquetOpen`, and 
related) need rewiring, and the bloom-filter stage should be checked for any 
dependence on the current ordering.
   
   Relevant code:
   
   - Opener state machine and stages: 
`datafusion/datasource-parquet/src/opener/mod.rs`
   - Page index load helper (the `missing_column_index || missing_offset_index` 
guard): `load_page_index` in the same file
   - Fully-matched page pruning: `PagePruningAccessPlanFilter` in 
`datafusion/datasource-parquet/src/page_filter.rs`
   - `IS NOT NULL` rewrite to `null_count != row_count`: 
`datafusion/pruning/src/pruning_predicate.rs`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to