asolimando commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3088387725
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -2898,7 +2920,11 @@ impl DefaultPhysicalPlanner {
.into_iter()
.map(|(expr, alias)| ProjectionExpr { expr, alias })
.collect();
- Ok(Arc::new(ProjectionExec::try_new(proj_exprs, input_exec)?))
+ let mut proj_exec = ProjectionExec::try_new(proj_exprs,
input_exec)?;
Review Comment:
@kosiew, thanks for your patience, here's what changed since your last
review.
### Addressing the plan-shape dependency:
To address this I finally included a re-injection loop in the physical
planner that runs after each optimizer rule that modifies the plan (detected
via `Arc::ptr_eq`). This ensures optimizer-created nodes (e.g. `FilterExec`
from filter pushdown into unions, repositioned filters through joins, etc.)
always receive the registry. The walk is entirely skipped when the
ExpressionAnalyzer is disabled, no additional overhead for the common path.
Caveat: nodes created within a rule (e.g. a new FilterExec) receive the
registry only _after_ that rule completes, before the next rule runs. No such
rule (`EnforceDistribution`, `CombinePartialFinalAggregate`,
`ProjectionPushdown`, `AggregateStatistics` and `JoinSelection`) reads stats
from the nodes it just created within the same rule execution, so there's no
gap at the moment.
Since there is no gap and that this concern will go away if we integrate
fully with `partitions_statistics` (see the "Future direction" section), which
is the natural evolution of this proposal, I consider the integration
satisfactory, what do you think?
### Integration with StatisticsRegistry (#21483):
The `AggregateStatisticsProvider` and `JoinStatisticsProvider` providers
consume `ExpressionAnalyzer` and this adds support for aggregations and joins.
When no registry is injected, they delegate to the existing code path, purely
additive as the rest.
### Future direction: removing the injection overhead (just FYI, no action
for this PR)
The re-injection loop for `ExpressionAnalyzer` and the separate
`StatisticsRegistry` tree walk both exist because `partition_statistics` has no
way to receive external context and we don't want breaking changes.
But we could piggyback on #20184 (`child_stats` in `partition_statistics`)
as it plans to update `partition_statistics`'s signature anyway.
By generalizing slightly with a `StatisticsContext` parameter that carries
both the `ExpressionAnalyzerRegistry` and `StatisticsRegistry` providers, we
would get headroom for future changes while limiting breaking changes.
This would collapse both mechanisms into the natural `partition_statistics`
recursion, eliminating the injection walks and the separate bottom-up traversal
entirely while keeping the config flags to turn either framework on or off.
The separate walks are acceptable now as a trade-off for a purely additive
and non-breaking change, but this would be the natural evolution of the
proposal.
### New commits in detail:
I had to rebase on #21483, this will make incremental reviewing harder for
you, so let me walk you through the changes since your last review (commits
9-12 of 12):
9. Inject ExpressionAnalyzerRegistry into exec nodes and re-inject via
optimizer loop - the core change addressing your feedback, includes integration
tests
10. test(filter): verify ExpressionAnalyzer inclusion-exclusion selectivity
for OR predicates - unit test for OR selectivity in FilterExec
11. refactor(expression_analyzer): delegate when no NDV statistics are
available - I realized we were trying to do too much when some columns
statistics are unset, now we delegate to built-in
12. test(expression_analyzer): add StatisticsTable and end-to-end SLT for OR
selectivity - end-to-end SLT tests verifying the registry survives TopK sort,
UNION ALL filter pushdown, and hash join filter pushdown (queries chosen as
they generate nodes that would need the `ExpressionAnalyzer`, exactly the gap
we had before)
Looking forward to your feedback!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]