Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

via GitHub Mon, 02 Mar 2026 07:17:02 -0800


asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2873011310



##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
                     }
                 }
             } else {
-                // TODO stats: estimate more statistics from expressions
-                // (expressions should compute their statistics themselves)
-                ColumnStatistics::new_unknown()
+                // TODO: expressions should compute their own statistics

Review Comment:
   Sorry for the late reply @adriangb, I have been off for a little while.
   
   I have been following https://github.com/apache/datafusion/pull/19609 with 
lots of interest, and in my understanding it's dealing with statistics for what 
concerns filtering and predicate pruning, so the two approaches are orthogonal 
and complement each other.
   
   This PR focuses on NDV, laying the foundation of improved cardinality 
estimation (which we will soon be able to precisely measure thanks to 
https://github.com/apache/datafusion/pull/20292).
   
   For DataFusion, the benefit is for improving some existing configuration 
options, when statistics are available, a few examples for NDV:
   - 
[prefer_hash_join](https://github.com/apache/datafusion/blob/1f37a33ce530bdedcaf3aba65295703874cd7d09/datafusion/common/src/config.rs#L1073):
 with NDV the optimizer could compare `HashJoin` vs `SortMergeJoin` cost, 
especially when downstream ordering is needed, enabling per-join cost-based 
decisions
   - 
[default_filter_selectivity](https://github.com/apache/datafusion/blob/1f37a33ce530bdedcaf3aba65295703874cd7d09/datafusion/common/src/config.rs#L1121)
 (20% hardcoded value): with NDV, equality filters become `1/NDV(col)`, 
IN-lists become `list_size/NDV(col)`, etc.
   
   This is also relevant for distributed DataFusion, where cardinality 
estimation plays an even larger role in physical planning (many planning 
choices can't be corrected via adaptive query processing in a distributed 
setup).
   
   The PR got stale and I will need to rebase on current main branch, so 
interested reviewers can take a look.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

Reply via email to