kosiew commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3092441214
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -2769,15 +2771,30 @@ impl DefaultPhysicalPlanner {
// to verify that the plan fulfills the base requirements.
InvariantChecker(InvariantLevel::Always).check(&plan)?;
+ let use_expression_analyzer = session_state
+ .config_options()
+ .optimizer
+ .use_expression_analyzer;
let mut new_plan = Arc::clone(&plan);
for optimizer in optimizers {
let before_schema = new_plan.schema();
+ let plan_before_rule = Arc::clone(&new_plan);
new_plan = optimizer
.optimize_with_context(new_plan, session_state)
.map_err(|e| {
DataFusionError::Context(optimizer.name().to_string(),
Box::new(e))
})?;
+ // Re-inject ExpressionAnalyzer registry into any exec nodes
created or replaced by
+ // this rule. Skip if the rule returned the same plan unchanged to
+ // avoid an O(nodes) walk for no-op rules.
+ if use_expression_analyzer && !Arc::ptr_eq(&plan_before_rule,
&new_plan) {
Review Comment:
I looked at the registry injection behavior and I agree the situation is
better than it first appeared. Since `inject_expression_analyzer` walks the
whole tree via `transform_up` whenever any optimizer rule changes the plan,
nodes created earlier do end up getting the registry in most cases.
That said, there is still a narrow gap. If every physical optimizer rule is
a no-op, the injection pass never runs, and planner-built nodes remain without
a registry. This is probably rare in practice, but it does mean the behavior
still depends slightly on whether any rule fires.
It would be good to either close this gap entirely or document that this
edge case exists so the behavior is easier to reason about.
##########
datafusion/physical-plan/src/operator_statistics/mod.rs:
##########
@@ -782,9 +779,16 @@ impl StatisticsProvider for JoinStatisticsProvider {
return Ok(StatisticsResult::Delegate);
};
- use crate::joins::JoinOnRef;
+ let default_analyzer;
+ let analyzer = match plan.expression_analyzer_registry() {
Review Comment:
I think there is a small mismatch between the intended behavior and what is
implemented for joins. The aggregate path does delegate when no registry is
present, which matches the description. However, `JoinStatisticsProvider`
creates a fresh `ExpressionAnalyzerRegistry::new()` when none is found.
That effectively enables expression-based NDV estimation for joins even when
the planner did not inject a registry, and even if `use_expression_analyzer` is
false as long as `use_statistics_registry` is true. So this is not purely
additive and it bypasses the usual config gate and the expectation that
analyzers come from exec nodes.
It would be good to align this with the aggregate behavior or clarify the
intended contract, since right now joins behave differently in a way that may
surprise users.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]