Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

via GitHub Fri, 22 May 2026 00:58:25 -0700


2010YOUY01 commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3286774613



##########
datafusion/physical-expr/src/expression_analyzer/mod.rs:
##########
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Pluggable expression-level statistics analysis.
+//!
+//! This module provides an extensible mechanism for computing expression-level
+//! statistics metadata (selectivity, NDV, min/max bounds) following the chain
+//! of responsibility pattern.
+//!
+//! # Overview
+//!
+//! Different expressions have different statistical properties:
+//!
+//! - **Injective functions** (UPPER, LOWER, ABS on non-negative): preserve NDV
+//! - **Non-injective functions** (FLOOR, YEAR, SUBSTRING): reduce NDV
+//! - **Monotonic functions**: allow min/max bound propagation
+//! - **Constants**: NDV = 1, selectivity depends on value
+//!
+//! The default implementation uses classic Selinger-style estimation. Users 
can
+//! register custom [`ExpressionAnalyzer`] implementations to:
+//!
+//! 1. Provide statistics for custom UDFs
+//! 2. Override default estimation with domain-specific knowledge
+//! 3. Plug in advanced approaches (e.g., histogram-based estimation)
+//!
+//! # Example
+//!
+//! ```ignore
+//! use datafusion_physical_expr::expression_analyzer::*;
+//!
+//! // Create registry with default analyzer
+//! let mut registry = ExpressionAnalyzerRegistry::new();
+//!
+//! // Register custom analyzer (higher priority)
+//! registry.register(Arc::new(MyCustomAnalyzer));
+//!
+//! // Query expression statistics
+//! let selectivity = registry.get_selectivity(&predicate, &input_stats);
+//! ```
+
+mod default;
+
+#[cfg(test)]
+mod tests;
+
+pub use default::DefaultExpressionAnalyzer;
+
+use std::fmt::Debug;
+use std::sync::Arc;
+
+use datafusion_common::{ScalarValue, Statistics};
+
+use crate::PhysicalExpr;
+
+/// Result of expression analysis - either computed or delegate to next 
analyzer.
+#[derive(Debug, Clone)]
+pub enum AnalysisResult<T> {
+    /// Analysis was performed, here's the result
+    Computed(T),
+    /// This analyzer doesn't handle this expression; delegate to next
+    Delegate,
+}
+
+/// Expression-level metadata analysis.
+///
+/// Implementations can handle specific expression types or provide domain
+/// knowledge for custom UDFs. The chain of analyzers is traversed until one
+/// returns [`AnalysisResult::Computed`].
+///
+/// The `registry` parameter allows analyzers to delegate sub-expression
+/// analysis back through the full chain, rather than hard-coding a specific
+/// analyzer. For example, a function analyzer can ask the registry for the
+/// NDV of its input argument, which will traverse the full chain (including
+/// any custom analyzers the user registered).
+///
+/// # Implementing a Custom Analyzer
+///
+/// ```ignore
+/// #[derive(Debug)]
+/// struct MyUdfAnalyzer;
+///
+/// impl ExpressionAnalyzer for MyUdfAnalyzer {
+///     fn get_selectivity(
+///         &self,
+///         expr: &Arc<dyn PhysicalExpr>,
+///         input_stats: &Statistics,
+///         registry: &ExpressionAnalyzerRegistry,
+///     ) -> AnalysisResult<f64> {
+///         // Recognize my custom is_valid_email() UDF
+///         if is_my_email_validator(expr) {
+///             return AnalysisResult::Computed(0.8); // ~80% valid
+///         }
+///         AnalysisResult::Delegate
+///     }
+/// }
+/// ```
+pub trait ExpressionAnalyzer: Debug + Send + Sync {

Review Comment:
   I think we could unify those APIs to a single one
   
   ```
   pub trait ExpressionAnalyzer {
       fn analyze(...) -> AnalyzeResult<ExpressionAnalysis>
   }
   
   pub enum ExpressionAnalysis {
       Predicate(PredicateStats),
       Value(ExprStats),
   }
   
   pub struct PredicateStats {
       pub selectivity: f64,
   }
   
   #[derive(Default, Clone)]
   pub struct ExprStats {
       pub distinct_count: Option<usize>,
       pub min_max: Option<(ScalarValue, ScalarValue)>,
       pub null_fraction: Option<f64>,
   }
   ```
   
   The reasons are
   - Type safety, for example `get_selectivity(a+1)` should return error, the 
enum approach can handle that easier
   - Potentially cleaner -- if this registry grow large, we might want to stat 
propagation for a single function to stay close to each other. Also I suspect 
there might be some advanced stat type has to consider several other existing 
stat type to infer the output, so grouping the stat type can also be useful 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

Reply via email to