Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

via GitHub Wed, 01 Apr 2026 21:10:25 -0700


kosiew commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3025779809



##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -713,9 +745,35 @@ impl ProjectionExprs {
                         byte_size,
                     }
                 }
+            } else if let Some(registry) = &self.expression_analyzer_registry {

Review Comment:
   I think there is a subtle ordering issue here. `project_statistics` asks the 
registry for derived expression stats after earlier bare-column projections 
have already taken their source column stats out of `stats.column_statistics`.
   
   That makes estimation order dependent. For example, with `[a, a + 1]`, we 
preserve stats for `a`, but then lose them for `a + 1` because the analyzer now 
sees column 0 as unknown.
   
   Would it make sense to keep an immutable copy of the input stats for 
analyzer lookups, or defer the `take`, so later expressions can still see the 
original column metadata?



##########
datafusion/physical-expr/src/expression_analyzer/default.rs:
##########
@@ -0,0 +1,285 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Default expression analyzer with Selinger-style estimation.
+
+use std::sync::Arc;
+
+use datafusion_common::{ColumnStatistics, ScalarValue, Statistics};
+use datafusion_expr::Operator;
+
+use crate::PhysicalExpr;
+use crate::expressions::{BinaryExpr, Column, Literal, NotExpr};
+
+use super::{AnalysisResult, ExpressionAnalyzer, ExpressionAnalyzerRegistry};
+
+/// Default expression analyzer with Selinger-style estimation.
+///
+/// Handles common expression types:
+/// - Column references (uses column statistics)
+/// - Binary expressions (AND, OR, comparison operators)
+/// - Literals (constant selectivity/NDV)
+/// - NOT expressions (1 - child selectivity)
+#[derive(Debug, Default, Clone)]
+pub struct DefaultExpressionAnalyzer;
+
+impl DefaultExpressionAnalyzer {
+    /// Get column index from a Column expression
+    fn get_column_index(expr: &Arc<dyn PhysicalExpr>) -> Option<usize> {
+        expr.as_any().downcast_ref::<Column>().map(|c| c.index())
+    }
+
+    /// Get column statistics for an expression if it's a column reference
+    fn get_column_stats<'a>(
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &'a Statistics,
+    ) -> Option<&'a ColumnStatistics> {
+        Self::get_column_index(expr)
+            .and_then(|idx| input_stats.column_statistics.get(idx))
+    }
+
+    /// Recursive selectivity estimation through the registry chain
+    fn estimate_selectivity_recursive(
+        &self,
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &Statistics,
+        registry: &ExpressionAnalyzerRegistry,
+    ) -> f64 {
+        registry.get_selectivity(expr, input_stats).unwrap_or(0.5)
+    }
+}
+
+impl ExpressionAnalyzer for DefaultExpressionAnalyzer {
+    fn get_selectivity(
+        &self,
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &Statistics,
+        registry: &ExpressionAnalyzerRegistry,
+    ) -> AnalysisResult<f64> {
+        // Binary expressions: AND, OR, comparisons
+        if let Some(binary) = expr.as_any().downcast_ref::<BinaryExpr>() {
+            let sel = match binary.op() {
+                // Logical operators: need child selectivities
+                Operator::And => {
+                    let left_sel = self.estimate_selectivity_recursive(
+                        binary.left(),
+                        input_stats,
+                        registry,
+                    );
+                    let right_sel = self.estimate_selectivity_recursive(
+                        binary.right(),
+                        input_stats,
+                        registry,
+                    );
+                    left_sel * right_sel
+                }
+                Operator::Or => {
+                    let left_sel = self.estimate_selectivity_recursive(
+                        binary.left(),
+                        input_stats,
+                        registry,
+                    );
+                    let right_sel = self.estimate_selectivity_recursive(
+                        binary.right(),
+                        input_stats,
+                        registry,
+                    );
+                    left_sel + right_sel - (left_sel * right_sel)
+                }
+
+                // Equality: selectivity = 1/NDV
+                Operator::Eq => {
+                    let ndv = Self::get_column_stats(binary.left(), 
input_stats)
+                        .or_else(|| Self::get_column_stats(binary.right(), 
input_stats))
+                        .and_then(|s| s.distinct_count.get_value())
+                        .filter(|&&ndv| ndv > 0);
+                    if let Some(ndv) = ndv {
+                        return AnalysisResult::Computed(1.0 / (*ndv as f64));
+                    }
+                    0.1 // Default equality selectivity
+                }
+
+                // Inequality: selectivity = 1 - 1/NDV
+                Operator::NotEq => {
+                    let ndv = Self::get_column_stats(binary.left(), 
input_stats)
+                        .or_else(|| Self::get_column_stats(binary.right(), 
input_stats))
+                        .and_then(|s| s.distinct_count.get_value())
+                        .filter(|&&ndv| ndv > 0);
+                    if let Some(ndv) = ndv {
+                        return AnalysisResult::Computed(1.0 - (1.0 / (*ndv as 
f64)));
+                    }
+                    0.9
+                }
+
+                // Range predicates: classic 1/3 estimate
+                Operator::Lt | Operator::LtEq | Operator::Gt | Operator::GtEq 
=> 0.33,
+
+                // LIKE: depends on pattern, use conservative estimate
+                Operator::LikeMatch | Operator::ILikeMatch => 0.25,
+                Operator::NotLikeMatch | Operator::NotILikeMatch => 0.75,
+
+                // Other operators: default
+                _ => 0.5,
+            };
+
+            return AnalysisResult::Computed(sel);
+        }
+
+        // NOT expression: 1 - child selectivity
+        if let Some(not_expr) = expr.as_any().downcast_ref::<NotExpr>() {
+            let child_sel = self.estimate_selectivity_recursive(
+                not_expr.arg(),
+                input_stats,
+                registry,
+            );
+            return AnalysisResult::Computed(1.0 - child_sel);
+        }
+
+        // Literal boolean: 0.0 or 1.0
+        if let Some(b) = expr
+            .as_any()
+            .downcast_ref::<Literal>()
+            .and_then(|lit| match lit.value() {
+                ScalarValue::Boolean(Some(b)) => Some(*b),
+                _ => None,
+            })
+        {
+            return AnalysisResult::Computed(if b { 1.0 } else { 0.0 });
+        }
+
+        // Column reference as predicate (boolean column)
+        if expr.as_any().downcast_ref::<Column>().is_some() {
+            return AnalysisResult::Computed(0.5);
+        }
+
+        AnalysisResult::Delegate
+    }
+
+    fn get_distinct_count(
+        &self,
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &Statistics,
+        registry: &ExpressionAnalyzerRegistry,
+    ) -> AnalysisResult<usize> {
+        // Column reference: use column NDV
+        if let Some(ndv) = Self::get_column_stats(expr, input_stats)
+            .and_then(|col_stats| 
col_stats.distinct_count.get_value().copied())
+        {
+            return AnalysisResult::Computed(ndv);
+        }
+
+        // Literal: NDV = 1
+        if expr.as_any().downcast_ref::<Literal>().is_some() {
+            return AnalysisResult::Computed(1);
+        }
+
+        // BinaryExpr: for arithmetic with a literal operand, treat as 
injective

Review Comment:
   This looks a bit broader than the comment suggests. Treating every 
arithmetic-with-literal expression as injective might be risky.
   
   Cases like `col * 0`, integer `col / 2`, or especially `col % 10` can reduce 
NDV quite a lot, so returning the input NDV could overshoot.
   
   Maybe we could narrow this to clearly safe cases first, like `+` or `-` with 
literals, and possibly `*` by non-zero constants for floating types. 
Alternatively, adding a few regression tests to document the intended 
approximation range could also help.



##########
datafusion/physical-expr/src/expression_analyzer/default.rs:
##########
@@ -0,0 +1,285 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Default expression analyzer with Selinger-style estimation.
+
+use std::sync::Arc;
+
+use datafusion_common::{ColumnStatistics, ScalarValue, Statistics};
+use datafusion_expr::Operator;
+
+use crate::PhysicalExpr;
+use crate::expressions::{BinaryExpr, Column, Literal, NotExpr};
+
+use super::{AnalysisResult, ExpressionAnalyzer, ExpressionAnalyzerRegistry};
+
+/// Default expression analyzer with Selinger-style estimation.
+///
+/// Handles common expression types:
+/// - Column references (uses column statistics)
+/// - Binary expressions (AND, OR, comparison operators)
+/// - Literals (constant selectivity/NDV)
+/// - NOT expressions (1 - child selectivity)
+#[derive(Debug, Default, Clone)]
+pub struct DefaultExpressionAnalyzer;
+
+impl DefaultExpressionAnalyzer {
+    /// Get column index from a Column expression
+    fn get_column_index(expr: &Arc<dyn PhysicalExpr>) -> Option<usize> {
+        expr.as_any().downcast_ref::<Column>().map(|c| c.index())
+    }
+
+    /// Get column statistics for an expression if it's a column reference
+    fn get_column_stats<'a>(
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &'a Statistics,
+    ) -> Option<&'a ColumnStatistics> {
+        Self::get_column_index(expr)
+            .and_then(|idx| input_stats.column_statistics.get(idx))
+    }
+
+    /// Recursive selectivity estimation through the registry chain
+    fn estimate_selectivity_recursive(
+        &self,
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &Statistics,
+        registry: &ExpressionAnalyzerRegistry,
+    ) -> f64 {
+        registry.get_selectivity(expr, input_stats).unwrap_or(0.5)
+    }
+}
+
+impl ExpressionAnalyzer for DefaultExpressionAnalyzer {
+    fn get_selectivity(
+        &self,
+        expr: &Arc<dyn PhysicalExpr>,
+        input_stats: &Statistics,
+        registry: &ExpressionAnalyzerRegistry,
+    ) -> AnalysisResult<f64> {
+        // Binary expressions: AND, OR, comparisons
+        if let Some(binary) = expr.as_any().downcast_ref::<BinaryExpr>() {
+            let sel = match binary.op() {
+                // Logical operators: need child selectivities
+                Operator::And => {
+                    let left_sel = self.estimate_selectivity_recursive(
+                        binary.left(),
+                        input_stats,
+                        registry,
+                    );
+                    let right_sel = self.estimate_selectivity_recursive(
+                        binary.right(),
+                        input_stats,
+                        registry,
+                    );
+                    left_sel * right_sel
+                }
+                Operator::Or => {
+                    let left_sel = self.estimate_selectivity_recursive(
+                        binary.left(),
+                        input_stats,
+                        registry,
+                    );
+                    let right_sel = self.estimate_selectivity_recursive(
+                        binary.right(),
+                        input_stats,
+                        registry,
+                    );
+                    left_sel + right_sel - (left_sel * right_sel)
+                }
+
+                // Equality: selectivity = 1/NDV
+                Operator::Eq => {
+                    let ndv = Self::get_column_stats(binary.left(), 
input_stats)

Review Comment:
   Right now the equality selectivity path only looks at direct column 
statistics from the left and right children. That means predicates whose NDV is 
available through another analyzer still fall back to `0.1`.
   
   For example, `(a + 1) = 42` or `my_udf(a) = 42` would not benefit from the 
new pluggable NDV estimation, even though `get_distinct_count` could derive it.
   
   Would it be reasonable to call `registry.get_distinct_count(...)` on the 
non-literal side here? That would let the new framework actually improve 
selectivity for transformed expressions and UDFs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

Reply via email to