Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

via GitHub Sat, 04 Apr 2026 17:43:22 -0700


2010YOUY01 commented on code in PR #20926:
URL: https://github.com/apache/datafusion/pull/20926#discussion_r3036219189



##########
datafusion/physical-plan/src/aggregates/mod.rs:
##########
@@ -1037,6 +1037,34 @@ impl AggregateExec {
         &self.input_order_mode
     }
 
+    /// Estimates output statistics for this aggregate node.
+    ///
+    /// For grouped aggregations with known input row count > 1, the output row
+    /// count is estimated as:
+    ///
+    /// ```text
+    /// ndv        = sum over each grouping set of product(max(NDV_i + 
nulls_i, 1))
+    /// output_rows = input_rows                       // baseline
+    /// output_rows = min(output_rows, ndv)             // if NDV available
+    /// output_rows = min(output_rows, limit)           // if TopK active
+    /// ```

Review Comment:
   It would be great to include one or two numeric examples here.



##########
datafusion/physical-plan/src/aggregates/mod.rs:
##########
@@ -1113,6 +1163,34 @@ impl AggregateExec {
         }
     }
 
+    /// Computes the estimated number of distinct groups across all grouping 
sets.
+    /// For each grouping set, computes `product(NDV_i + null_adj_i)` for 
active columns,
+    /// then sums across all sets. Returns `None` if any active column is not 
a direct
+    /// column reference or lacks `distinct_count` stats.
+    /// When `null_count` is absent or unknown, null_adjustment defaults to 0.
+    fn compute_group_ndv(&self, child_statistics: &Statistics) -> 
Option<usize> {

Review Comment:
   One minor code structure suggestion, totally optionally.
   
   Now this function only handles ndv estimation, and the fallback logic and 
limit handling (if returns None, uses num_rows instead) is handled at the 
caller. The issue is the similar logic gets scattered to multiple places, and 
this might bring maintenance overhead.
   An alternative I think can make it organize nicer: put the fallback logic 
also inside this function (adding num_rows, limit to arg, and directly return 
`usize`)



##########
datafusion/physical-plan/src/aggregates/mod.rs:
##########
@@ -1113,6 +1163,34 @@ impl AggregateExec {
         }
     }
 
+    /// Computes the estimated number of distinct groups across all grouping 
sets.
+    /// For each grouping set, computes `product(NDV_i + null_adj_i)` for 
active columns,
+    /// then sums across all sets. Returns `None` if any active column is not 
a direct
+    /// column reference or lacks `distinct_count` stats.
+    /// When `null_count` is absent or unknown, null_adjustment defaults to 0.
+    fn compute_group_ndv(&self, child_statistics: &Statistics) -> 
Option<usize> {
+        let mut total: usize = 0;
+        for group_mask in &self.group_by.groups {
+            let mut set_product: usize = 1;
+            for (j, (expr, _)) in self.group_by.expr.iter().enumerate() {
+                if group_mask[j] {
+                    continue;
+                }
+                let col = expr.as_any().downcast_ref::<Column>()?;

Review Comment:
   Just to ensure I understand it correctly: If group by expression is 
non-column (e.g. `abs(a)`), this functions returns None, and its because stats 
on expressions are not supported yet, and its WIP via 
https://github.com/apache/datafusion/pull/21122
   I think this is also worth mentioning in the comment somewhere.



##########
datafusion/physical-plan/src/aggregates/mod.rs:
##########
@@ -1113,6 +1163,34 @@ impl AggregateExec {
         }
     }
 
+    /// Computes the estimated number of distinct groups across all grouping 
sets.
+    /// For each grouping set, computes `product(NDV_i + null_adj_i)` for 
active columns,
+    /// then sums across all sets. Returns `None` if any active column is not 
a direct
+    /// column reference or lacks `distinct_count` stats.
+    /// When `null_count` is absent or unknown, null_adjustment defaults to 0.

Review Comment:
   It would be great to add some simple numeric examples for multiple group 
keys, to demonstrate how is NDV calculated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Estimate aggregate output rows using existing NDV statistics [datafusion]

Reply via email to