adriangb opened a new pull request, #22498:
URL: https://github.com/apache/datafusion/pull/22498

   ## Which issue does this PR close?
   
   - Supersedes #22235 (same goal — per-conjunct pruning stats for adaptive 
filter scheduling — with a different, more reusable API). Please close #22235 
in favor of this if the direction lands.
   - Part of #22144 (Adaptive filter pushdown).
   
   ## Rationale for this change
   
   #22235 surfaced per-conjunct pruning effectiveness by bolting a second, 
AND-only code path onto `PruningPredicate` (a marker wrapper holding 
`sub_predicates`, a placeholder `true` literal as its `predicate_expr`, and a 
`prune_per_conjunct` that re-walks the leaves a second time). Review feedback 
(thanks @asolimando) flagged three things, all of which trace back to that 
shape:
   
   1. Double traversal — `prune` walks the leaves, then `prune_per_conjunct` 
walks them again.
   2. Duplicated page-pruning method (`prune_plan_with_per_conjunct_stats`) 
that is a near-copy of the untagged path.
   3. The wrapper's `predicate_expr()` / `literal_guarantees()` / 
`required_columns()` silently return placeholder values, which is misuse-prone.
   
   Plus two open design questions: AND-only (no OR/NOT), and order-dependent 
stats from the page-index short-circuit (later conjuncts get biased 
`rows_seen`).
   
   This PR reshapes the feature around a small **boolean expression tree** with 
**observer-driven evaluation**, which resolves all of the above and is reusable 
for row-group pruning, page pruning, and (sketched) the RowFilter decode path.
   
   ## What changes are included in this PR?
   
   Stacked into four reviewable commits, each builds green:
   
   1. **`PruningPredicateTree`** (`datafusion-pruning`) — composes existing 
`PruningPredicate` leaves under `And` / `Or` nodes. Evaluation walks the tree 
once; `AND` short-circuits on all-false, `OR` on all-true. A 
`PruningObserver::on_leaf(tag, mask)` fires **only for leaves actually 
evaluated**, so stats are never biased by short-circuited predicates. 
`NoopObserver` makes the untagged path zero-overhead. `NOT` is intentionally 
not a tree node (bitwise-negating a conservative pruning mask is unsound) — 
callers push `NOT` to the leaf via `PruningPredicate::try_new`.
   
   2. **Tagged constructor + `ConjunctStatsObserver`** — 
`PruningPredicateTree::try_new_tagged_conjuncts(&[(Tag, expr)], &schema)` 
builds an `And` of tagged leaves; `ConjunctStatsObserver` accumulates 
`PerConjunctPruneStats { tag, containers_seen, containers_pruned }`. No 
placeholder expressions, no second field on `PruningPredicate`.
   
   3. **Parquet row-group + page filters** consume the tree via 
`*_with_observer` methods. The page-index loop now has one shared body (no 
duplicated method); the `!selects_any` short-circuit fires after a leaf was 
observed, so unevaluated leaves are correctly absent from per-conjunct stats.
   
   4. **RowFilter hook (sketch)** — `build_row_filter_tagged` + a per-predicate 
`Tag`/`RowFilterObserver` on `DatafusionArrowPredicate`. Emits *conditional* 
selectivity (rows reaching each predicate after prior ones filtered) — what an 
adaptive scheduler actually wants. The module docs describe a follow-up 
`CompoundArrowPredicate` that would own a sub-tree and do datafusion-side 
OR/NOT short-circuit during decode.
   
   ### How this maps to the review on #22235
   | Concern | Resolution |
   |---|---|
   | Double traversal | Single observer-driven walk |
   | Duplicated page method | One shared body, observer param |
   | Placeholder `predicate_expr` | Gone — honest tree, leaves are real 
`PruningPredicate`s |
   | AND-only (Q1) | `Or` supported now; `Not` pushed to leaves |
   | Order-dependent stats (Q2) | Short-circuited leaves are never observed |
   
   ## Are these changes tested?
   
   Yes — unit tests for the tree (leaf equivalence, AND/OR short-circuit 
observation, tagged stats accumulation, always-true dropping) and a row-group 
filter test asserting the observer surfaces correct per-conjunct stats while 
preserving the existing pruning result. All existing 
`datafusion-datasource-parquet` lib tests pass unchanged.
   
   ## Are there any user-facing changes?
   
   New public API in `datafusion-pruning` (`PruningPredicateTree`, 
`PruningObserver`, `ConjunctStatsObserver`, `PerConjunctPruneStats`, `Tag`) and 
`datafusion-datasource-parquet` (`build_row_filter_tagged`, 
`RowFilterObserver`, observer-accepting prune methods). Purely additive; 
existing `prune` / `prune_by_statistics` / `prune_plan_with_page_index` paths 
are unchanged.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to