Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

via GitHub Thu, 12 Mar 2026 02:59:14 -0700


gene-bordegaray commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2923487862



##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -541,6 +541,36 @@ fn summarize_min_max_null_counts(
     )
     .map(|(idx, _)| idx);
 
+    // Extract distinct counts from row group column statistics
+    accumulators.distinct_counts_array[logical_schema_index] =
+        if let Some(parquet_idx) = parquet_index {
+            let distinct_counts: Vec<u64> = row_groups_metadata
+                .iter()
+                .filter_map(|rg| {
+                    rg.columns()
+                        .get(parquet_idx)
+                        .and_then(|col| col.statistics())
+                        .and_then(|stats| stats.distinct_count_opt())
+                })
+                .collect();
+
+            if distinct_counts.is_empty() {
+                Precision::Absent
+            } else if distinct_counts.len() == 1 {

Review Comment:
   I think the partial-coverage case still needs tightening.
    
   If we merge multiple row groups and only some of them expose NDV returning 
NDV as Exact for the whole file doesn't seem accurrate to me. Could we return 
make it Inexact? 
   
   The test `test_distinct_count_partial_ndv_in_row_groups` seems wrong 
asserting `Exact(15)` for a file where another row group has unknown NDV.
   



##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
             col_stats.max_value = 
col_stats.max_value.max(&item_col_stats.max_value);
             col_stats.min_value = 
col_stats.min_value.min(&item_col_stats.min_value);
             col_stats.sum_value = 
col_stats.sum_value.add(&item_col_stats.sum_value);
-            col_stats.distinct_count = Precision::Absent;
+            // Use max as a conservative lower bound for distinct count
+            // (can't accurately merge NDV since duplicates may exist across 
partitions)

Review Comment:
   If we keep max() as the merge rule, I think we should document that merged 
NDV is treated as a lower-bound-ish estimate in public facing comments.
   



##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
                     }
                 }
             } else {
-                // TODO stats: estimate more statistics from expressions
-                // (expressions should compute their statistics themselves)
-                ColumnStatistics::new_unknown()
+                // TODO: expressions should compute their own statistics

Review Comment:
   Following up on this thread for my review:
   
   I’d prefer to drop the new NDV propagation in projections from this PR. The 
heuristic “expression references a single column” is too broad to be correct 
for NDV. The thread already suggests handling this in a separate expression 
statistics design. For this PR I think derived expressions should stay unknown, 
then address the rest in a follow-up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Extract NDV (distinct_count) statistics from Parquet metadata [datafusion]

Reply via email to