asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2728830062
##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
}
}
} else {
- // TODO stats: estimate more statistics from expressions
- // (expressions should compute their statistics themselves)
- ColumnStatistics::new_unknown()
+ // TODO: expressions should compute their own statistics
Review Comment:
Good question! I have an idea of how this could evolve, based on my
experience with Apache Calcite.
The idea is to make statistics propagation pluggable, with each relational
operator having a default but configurable logic for how statistics propagate
through it.
The default implementation would follow classic Selinger-style estimation
(selectivity factors, independence assumptions), as seen in this PR. A nice
intro to this can be found
[here](https://15799.courses.cs.cmu.edu/spring2025/slides/13-cardinalities1.pdf).
That's what most OSS databases implement by default.
Following DataFusion's philosophy as a customizable framework, users should
be able to override and complement this logic when needed.
Proposed architecture:
- `StatisticsProvider`: chain element that computes statistics for specific
operators (returns `Computed` or `Delegate`)
- `StatisticsRegistry`: chains providers, would live in `SessionState`
- `CardinalityEstimator`: unified interface for metadata queries (row count,
selectivity, NDV, ...) - similar to Calcite's
[RelMetadataProvider](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataProvider.html)/[RelMetadataQuery](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataQuery.html)
- `ExtendedStatistics`: `Statistics` with type-safe custom extensions for
histograms, sketches, etc. (I am looking at type-erased maps for that but I am
not sure that's the best way to implement it)
- `ExpressionAnalyzerRegistry`+`ExpressionAnalyzer`: similar concept of
chain for expression analyzers, equivalent to what detailed above for the
operators, so that built-in and UDF can be covered
This follows the same chain-of-responsibility pattern that the
https://datafusion.apache.org/blog/2026/01/12/extending-sql/ solved for custom
syntax/relations. Built-in operators get default handling, custom
`ExecutionPlan` nodes can plug in their own logic, and unknown cases delegate
down the chain. To override the default estimation (e.g., with histogram-based
approaches), you register your provider before the default one.
StatisticsV2 and Distribution-based are very advanced and interesting
statistics, but I see them more as an extension via `ExtendedStatistics` than
taking over the default implementation. What you are wondering about NDVs, for
instance, is correct, it deals with distributions but as-is it can't answer
question around number of distinct values.
If this sounds interesting and aligns with community interest, I can provide
a more detailed design doc and an epic to break down the work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]