Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

via GitHub Mon, 26 Jan 2026 01:58:32 -0800


gabotechs commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2726962898



##########
datafusion/common/src/stats.rs:
##########
@@ -632,7 +632,14 @@ impl Statistics {
             col_stats.max_value = 
col_stats.max_value.max(&item_col_stats.max_value);
             col_stats.min_value = 
col_stats.min_value.min(&item_col_stats.min_value);
             col_stats.sum_value = 
col_stats.sum_value.add(&item_col_stats.sum_value);
-            col_stats.distinct_count = Precision::Absent;
+            // Use max as a conservative lower bound for distinct count
+            // (can't accurately merge NDV since duplicates may exist across 
partitions)
+            col_stats.distinct_count = col_stats
+                .distinct_count
+                .get_value()
+                .max(item_col_stats.distinct_count.get_value())
+                .map(|&v| Precision::Inexact(v))
+                .unwrap_or(Precision::Absent);

Review Comment:
   👍 thinks sounds pretty reasonable to me



##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
                     }
                 }
             } else {
-                // TODO stats: estimate more statistics from expressions
-                // (expressions should compute their statistics themselves)
-                ColumnStatistics::new_unknown()
+                // TODO: expressions should compute their own statistics

Review Comment:
   I think this TODO comment is right on point: if we wanted to compute 
statistics for all expressions here, it would become unmanageable pretty soon.
   
   Even if the estimation about expressions referencing single columns here is 
an improvement, it would be awesome if we could try to be one step ahead and 
think how proper stats propagation across expressions would look like. That 
way, this PR, instead of special-handling one particular case, it could be 
setting the foundation on top of which more people can start contributing 
smarter stats calculation.
   
   I see that some prior work has been done in order to improve stats 
propagation in expressions. Some context here:
   - https://github.com/apache/datafusion/pull/14699
   - https://github.com/apache/datafusion/issues/14896
   
   By looking at the API for handling statistics in expressions shipped in 
https://github.com/apache/datafusion/pull/14699... I would not know how to use 
it in order to properly calculate NDV values. I get the impression that 1) we 
do not want to introduce yet another API for propagating stats across 
expressions and 2) the current "new" API shipped in 
https://github.com/apache/datafusion/pull/14699 is not suitable for propagating 
NDV values.
   
   



##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -411,53 +415,6 @@ fn create_max_min_accs(
     (max_values, min_values)
 }
 
-fn get_col_stats(

Review Comment:
   Correct me if I'm wrong but the `get_col_stats()` changes look more like a 
noop refactor right? I think it looks good 👍 but just double checking that 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

Reply via email to