[I] Pluggable expression-level statistics estimation (ExpressionAnalyzer) [datafusion]

via GitHub Mon, 23 Mar 2026 10:07:19 -0700


asolimando opened a new issue, #21120:
URL: https://github.com/apache/datafusion/issues/21120


   ### Is your feature request related to a problem or challenge?
   
   DataFusion currently loses expression-level statistics when computing plan 
metadata:
   
   - Projections: any expression that isn't a bare column or literal gets `NDV 
= Absent`, even for simple cases like `col + 1` or `UPPER(name)` where NDV is 
trivially derivable from the input
   - Filters: when interval analysis cannot handle a predicate (`check_support` 
returns false), selectivity falls back to a hardcoded 20% regardless of 
available column statistics
   - Custom UDFs: there is no way for users to provide statistics metadata for 
their functions, making all UDFs opaque to the optimizer
   
   Without expression-level statistics, the optimizer lacks the information it 
needs for join ordering, cardinality estimation, and cost-based decisions 
involving computed columns or UDFs. Projects embedding DataFusion currently 
have no extension point to provide this information for their own functions.
   
   Related: this was previously raised in #992 (closed as non-actionable at the 
time).
   
   ### Describe the solution you'd like
   
   A pluggable chain-of-responsibility framework for expression-level 
statistics, covering:
   
   1. Selectivity (predicate filtering fraction)
   2. NDV (number of distinct values)
   3. Min/max bounds
   4. Null fraction
   
   The framework should:
   
   - Ship with a default Selinger-style analyzer handling columns, literals, 
binary expressions (AND/OR/NOT/comparisons), and arithmetic
   - Include built-in analyzers for common function families (string, math, 
date_part/date_trunc)
   - Allow users to register custom analyzers via `SessionState` for 
UDF-specific or domain-specific estimation (e.g., histogram-based, 
geometry-aware)
   - Integrate into physical operators that need expression-level statistics 
(projections, filters, joins, aggregates, etc.)
   - Be non-breaking and purely additive
   
   ### Describe alternatives you've considered
   
   - Extending `PhysicalExpr::evaluate_statistics()` (#14699): this provides 
per-expression statistics but doesn't support chain delegation or 
user-registered overrides, and would require changes to the `PhysicalExpr` trait
   - Hardcoding heuristics in each operator (the status quo): does not scale as 
more expressions and operators need statistics, and provides no extension point 
for users
   - Distribution-based API (#14896, #14699): more powerful but significantly 
more complex to implement and adopt; ExpressionAnalyzer can serve as the 
foundation, with distribution-based estimation plugged in as a custom analyzer
   
   ### Planned work
   
   Framework
   - [ ] ExpressionAnalyzer trait, chain-of-responsibility registry, 
SessionState integration
   - [ ] Default analyzer with Selinger-style heuristics (columns, literals, 
binary expressions, NOT)
   
   Built-in analyzers for common functions
   - [ ] String functions (UPPER, LOWER, TRIM, SUBSTRING, REPLACE, ...)
   - [ ] Math functions (FLOOR, CEIL, ROUND, ABS, EXP, LN, ...)
   - [ ] Date/time functions (date_part, date_trunc)
   
   Operator integration
   - [ ] Projection: propagate statistics through projected expressions
   - [ ] Filter: use analyzer selectivity when interval analysis is not 
applicable
   - [ ] Joins: expression-aware cardinality estimation for join key expressions
   - [ ] Aggregates: NDV-based output row estimation for GROUP BY expressions
   
   ### Additional context
   
   - Related: #992 (similar request, closed as non-actionable), #8227 
(statistics improvements epic), #14699 (expression statistics API), #14896 
(expression statistics tracking)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Pluggable expression-level statistics estimation (ExpressionAnalyzer) [datafusion]

Reply via email to