Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

via GitHub Wed, 15 Apr 2026 10:50:52 -0700


asolimando commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3088387725



##########
datafusion/core/src/physical_planner.rs:
##########
@@ -2898,7 +2920,11 @@ impl DefaultPhysicalPlanner {
                     .into_iter()
                     .map(|(expr, alias)| ProjectionExpr { expr, alias })
                     .collect();
-                Ok(Arc::new(ProjectionExec::try_new(proj_exprs, input_exec)?))
+                let mut proj_exec = ProjectionExec::try_new(proj_exprs, 
input_exec)?;

Review Comment:
   @kosiew, thanks for your patience, here's what changed since your last 
review.
   
   ### Addressing the plan-shape dependency:
   
   To address this I finally included a re-injection loop in the physical 
planner that runs after each optimizer rule that modifies the plan (detected 
via `Arc::ptr_eq`). This ensures optimizer-created nodes (e.g. `FilterExec` 
from filter pushdown into unions, repositioned filters through joins, etc.) 
always receive the registry. The walk is entirely skipped when the 
ExpressionAnalyzer is disabled, no additional overhead for the common path.
   
   Caveat: nodes created within a rule (e.g. a new FilterExec) receive the 
registry only _after_ that rule completes, before the next rule runs. No such 
rule (`EnforceDistribution`, `CombinePartialFinalAggregate`, 
`ProjectionPushdown`, `AggregateStatistics` and `JoinSelection`) reads stats 
from the nodes it just created within the same rule execution, so there's no 
gap at the moment.
   
   Since there is no gap and that this concern will go away if we integrate 
fully with `partitions_statistics` (see the "Future direction" section), which 
is the natural evolution of this proposal, I consider the integration 
satisfactory, what do you think?
   
   ### Integration with StatisticsRegistry (#21483):
   The `AggregateStatisticsProvider` and `JoinStatisticsProvider` providers 
consume `ExpressionAnalyzer` and this adds support for aggregations and joins. 
When no registry is injected, they delegate to the existing code path, purely 
additive as the rest.
   
   ### Future direction: removing the injection overhead (just FYI, no action 
for this PR)
   The re-injection loop for `ExpressionAnalyzer` and the separate 
`StatisticsRegistry` tree walk both exist because `partition_statistics` has no 
way to receive external context and we don't want breaking changes.
   But we could piggyback on #20184 (`child_stats` in `partition_statistics`) 
as it plans to update `partition_statistics`'s signature anyway.
   
   By generalizing slightly with a `StatisticsContext` parameter that carries 
both the `ExpressionAnalyzerRegistry` and `StatisticsRegistry` providers, we 
would get headroom for future changes while limiting breaking changes.
   
   This would collapse both mechanisms into the natural `partition_statistics` 
recursion, eliminating the injection walks and the separate bottom-up traversal 
entirely while keeping the config flags to turn either framework on or off.
   
   The separate walks are acceptable now as a trade-off for a purely additive 
and non-breaking change, but this would be the natural evolution of the 
proposal.
   
   ### New commits in detail:
   I had to rebase on #21483, this will make incremental reviewing harder for 
you, so let me walk you through the changes since your last review (commits 
9-12 of 12):
   
   9. Inject ExpressionAnalyzerRegistry into exec nodes and re-inject via 
optimizer loop - the core change addressing your feedback, includes integration 
tests
   10. test(filter): verify ExpressionAnalyzer inclusion-exclusion selectivity 
for OR predicates - unit test for OR selectivity in FilterExec
   11. refactor(expression_analyzer): delegate when no NDV statistics are 
available - I realized we were trying to do too much when some columns 
statistics are unset, now we delegate to built-in
   12. test(expression_analyzer): add StatisticsTable and end-to-end SLT for OR 
selectivity - end-to-end SLT tests verifying the registry survives TopK sort, 
UNION ALL filter pushdown, and hash join filter pushdown (queries chosen as 
they generate nodes that would need the `ExpressionAnalyzer`, exactly the gap 
we had before)
   
   Looking forward to your feedback!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

Reply via email to