kosiew commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3092441214


##########
datafusion/core/src/physical_planner.rs:
##########
@@ -2769,15 +2771,30 @@ impl DefaultPhysicalPlanner {
         // to verify that the plan fulfills the base requirements.
         InvariantChecker(InvariantLevel::Always).check(&plan)?;
 
+        let use_expression_analyzer = session_state
+            .config_options()
+            .optimizer
+            .use_expression_analyzer;
         let mut new_plan = Arc::clone(&plan);
         for optimizer in optimizers {
             let before_schema = new_plan.schema();
+            let plan_before_rule = Arc::clone(&new_plan);
             new_plan = optimizer
                 .optimize_with_context(new_plan, session_state)
                 .map_err(|e| {
                     DataFusionError::Context(optimizer.name().to_string(), 
Box::new(e))
                 })?;
 
+            // Re-inject ExpressionAnalyzer registry into any exec nodes 
created or replaced by
+            // this rule. Skip if the rule returned the same plan unchanged to
+            // avoid an O(nodes) walk for no-op rules.
+            if use_expression_analyzer && !Arc::ptr_eq(&plan_before_rule, 
&new_plan) {

Review Comment:
   I looked at the registry injection behavior and I agree the situation is 
better than it first appeared. Since `inject_expression_analyzer` walks the 
whole tree via `transform_up` whenever any optimizer rule changes the plan, 
nodes created earlier do end up getting the registry in most cases.
   
   That said, there is still a narrow gap. If every physical optimizer rule is 
a no-op, the injection pass never runs, and planner-built nodes remain without 
a registry. This is probably rare in practice, but it does mean the behavior 
still depends slightly on whether any rule fires.
   
   It would be good to either close this gap entirely or document that this 
edge case exists so the behavior is easier to reason about.



##########
datafusion/physical-plan/src/operator_statistics/mod.rs:
##########
@@ -782,9 +779,16 @@ impl StatisticsProvider for JoinStatisticsProvider {
             return Ok(StatisticsResult::Delegate);
         };
 
-        use crate::joins::JoinOnRef;
+        let default_analyzer;
+        let analyzer = match plan.expression_analyzer_registry() {

Review Comment:
   I think there is a small mismatch between the intended behavior and what is 
implemented for joins. The aggregate path does delegate when no registry is 
present, which matches the description. However, `JoinStatisticsProvider` 
creates a fresh `ExpressionAnalyzerRegistry::new()` when none is found.
   
   That effectively enables expression-based NDV estimation for joins even when 
the planner did not inject a registry, and even if `use_expression_analyzer` is 
false as long as `use_statistics_registry` is true. So this is not purely 
additive and it bypasses the usual config gate and the expectation that 
analyzers come from exec nodes.
   
   It would be good to align this with the aggregate behavior or clarify the 
intended contract, since right now joins behave differently in a way that may 
surprise users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to