asolimando opened a new issue, #21120: URL: https://github.com/apache/datafusion/issues/21120
### Is your feature request related to a problem or challenge? DataFusion currently loses expression-level statistics when computing plan metadata: - Projections: any expression that isn't a bare column or literal gets `NDV = Absent`, even for simple cases like `col + 1` or `UPPER(name)` where NDV is trivially derivable from the input - Filters: when interval analysis cannot handle a predicate (`check_support` returns false), selectivity falls back to a hardcoded 20% regardless of available column statistics - Custom UDFs: there is no way for users to provide statistics metadata for their functions, making all UDFs opaque to the optimizer Without expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions. Related: this was previously raised in #992 (closed as non-actionable at the time). ### Describe the solution you'd like A pluggable chain-of-responsibility framework for expression-level statistics, covering: 1. Selectivity (predicate filtering fraction) 2. NDV (number of distinct values) 3. Min/max bounds 4. Null fraction The framework should: - Ship with a default Selinger-style analyzer handling columns, literals, binary expressions (AND/OR/NOT/comparisons), and arithmetic - Include built-in analyzers for common function families (string, math, date_part/date_trunc) - Allow users to register custom analyzers via `SessionState` for UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware) - Integrate into physical operators that need expression-level statistics (projections, filters, joins, aggregates, etc.) - Be non-breaking and purely additive ### Describe alternatives you've considered - Extending `PhysicalExpr::evaluate_statistics()` (#14699): this provides per-expression statistics but doesn't support chain delegation or user-registered overrides, and would require changes to the `PhysicalExpr` trait - Hardcoding heuristics in each operator (the status quo): does not scale as more expressions and operators need statistics, and provides no extension point for users - Distribution-based API (#14896, #14699): more powerful but significantly more complex to implement and adopt; ExpressionAnalyzer can serve as the foundation, with distribution-based estimation plugged in as a custom analyzer ### Planned work Framework - [ ] ExpressionAnalyzer trait, chain-of-responsibility registry, SessionState integration - [ ] Default analyzer with Selinger-style heuristics (columns, literals, binary expressions, NOT) Built-in analyzers for common functions - [ ] String functions (UPPER, LOWER, TRIM, SUBSTRING, REPLACE, ...) - [ ] Math functions (FLOOR, CEIL, ROUND, ABS, EXP, LN, ...) - [ ] Date/time functions (date_part, date_trunc) Operator integration - [ ] Projection: propagate statistics through projected expressions - [ ] Filter: use analyzer selectivity when interval analysis is not applicable - [ ] Joins: expression-aware cardinality estimation for join key expressions - [ ] Aggregates: NDV-based output row estimation for GROUP BY expressions ### Additional context - Related: #992 (similar request, closed as non-actionable), #8227 (statistics improvements epic), #14699 (expression statistics API), #14896 (expression statistics tracking) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
