Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

via GitHub Mon, 26 Jan 2026 10:58:25 -0800


asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2728830062



##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
                     }
                 }
             } else {
-                // TODO stats: estimate more statistics from expressions
-                // (expressions should compute their statistics themselves)
-                ColumnStatistics::new_unknown()
+                // TODO: expressions should compute their own statistics

Review Comment:
   Good question! I have an idea of how this could evolve, based on my 
experience with Apache Calcite.
   
   The idea is to make statistics propagation pluggable, with each relational 
operator having a default but configurable logic for how statistics propagate 
through it.
   
   The default implementation would follow classic Selinger-style estimation 
(selectivity factors, independence assumptions), as seen in this PR. A nice 
intro to this can be found 
[here](https://15799.courses.cs.cmu.edu/spring2025/slides/13-cardinalities1.pdf).
 That's what most OSS databases implement by default.
   
   Following DataFusion's philosophy as a customizable framework, users should 
be able to override and complement this logic when needed.
   
   Proposed architecture:
   - `StatisticsProvider`: chain element that computes statistics for specific 
operators (returns `Computed` or `Delegate`)
   - `StatisticsRegistry`: chains providers, would live in `SessionState`
   - `CardinalityEstimator`: unified interface for metadata queries (row count, 
selectivity, NDV, ...) - similar to Calcite's 
[RelMetadataProvider](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataProvider.html)/[RelMetadataQuery](https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMetadataQuery.html)
   - `ExtendedStatistics`: `Statistics` with type-safe custom extensions for 
histograms, sketches, etc. (I am looking at type-erased maps for that but I am 
not sure that's the best way to implement it)
   - `ExpressionAnalyzerRegistry`+`ExpressionAnalyzer`: similar concept of 
chain for expression analyzers, equivalent to what detailed above for the 
operators, so that built-in and UDF can be covered
   
   This follows the same chain-of-responsibility pattern that the 
https://datafusion.apache.org/blog/2026/01/12/extending-sql/ solved for custom 
syntax/relations. Built-in operators get default handling, custom 
`ExecutionPlan` nodes can plug in their own logic, and unknown cases delegate 
down the chain. To override the default estimation (e.g., with histogram-based 
approaches), you register your provider before the default one.
   
   StatisticsV2 and Distribution-based are very advanced and interesting 
statistics, but I see them more as an extension via `ExtendedStatistics` than 
taking over the default implementation. What you are wondering about NDVs, for 
instance, is correct, it deals with distributions but as-is it can't answer 
question around number of distinct values.
   
   If this sounds interesting and aligns with community interest, I can provide 
a more detailed design doc and an epic to break down the work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

Reply via email to