Re: [PR] implement `AggregateExec.partition_statistics` [datafusion]

via GitHub Fri, 09 May 2025 00:42:37 -0700


UBarney commented on code in PR #15954:
URL: https://github.com/apache/datafusion/pull/15954#discussion_r2074906893



##########
datafusion/physical-plan/src/aggregates/mod.rs:
##########
@@ -751,28 +771,16 @@ impl AggregateExec {
                 })
             }
             _ => {
-                // When the input row count is 0 or 1, we can adopt that 
statistic keeping its reliability.
+                // When the input row count is 1, we can adopt that statistic 
keeping its reliability.
                 // When it is larger than 1, we degrade the precision since it 
may decrease after aggregation.
-                let num_rows = if let Some(value) = self
-                    .input()
-                    .partition_statistics(None)?
-                    .num_rows
-                    .get_value()
+                let num_rows = if let Some(value) = 
child_statistics.num_rows.get_value()
                 {
-                    if *value > 1 {
-                        self.input()
-                            .partition_statistics(None)?
-                            .num_rows
-                            .to_inexact()
-                    } else if *value == 0 {
-                        // Aggregation on an empty table creates a null row.

Review Comment:
   *   If `!group_by_expr.is_empty()` and `input_statistics.num_rows == 0`:
       *   Both `Partial` and `Final` aggregation modes (`agg.mode`) yield 0 
output rows. ((Note the AggregateExec metric: `[output_rows=0]`)
   ```
   > explain analyze select count(*) from generate_series(0) where value > 10 
group by value;
   
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                 |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | ProjectionExec: expr=[count(Int64(1))@1 as count(*)], 
metrics=[output_rows=0, elapsed_compute=24ns]                                   
                                                                                
                  |
   |                   |   AggregateExec: mode=FinalPartitioned, gby=[value@0 
as value], aggr=[count(Int64(1))], metrics=[output_rows=0, 
elapsed_compute=100.016µs, spill_count=0, spilled_bytes=0, spilled_rows=0, 
peak_mem_used=1536]                          |
   |                   |     CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=0, elapsed_compute=452ns]                                  
                                                                                
                       |
   |                   |       RepartitionExec: partitioning=Hash([value@0], 
24), input_partitions=24, metrics=[fetch_time=10.544607ms, 
repartition_time=24ns, send_time=576ns]                                         
                                         |
   |                   |         AggregateExec: mode=Partial, gby=[value@0 as 
value], aggr=[count(Int64(1))], metrics=[output_rows=0, 
elapsed_compute=170.537µs, spill_count=0, spilled_bytes=0, spilled_rows=0, 
skipped_aggregation_rows=0, peak_mem_used=1536] |
   |                   |           CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=0, elapsed_compute=663ns]                                  
                                                                                
                 |
   |                   |             FilterExec: value@0 > 10, 
metrics=[output_rows=0, elapsed_compute=2.201314ms]                             
                                                                                
                                  |
   |                   |               RepartitionExec: 
partitioning=RoundRobinBatch(24), input_partitions=1, 
metrics=[fetch_time=3.077µs, repartition_time=1ns, send_time=1.16µs]            
                                                                   |
   |                   |                 LazyMemoryExec: partitions=1, 
batch_generators=[generate_series: start=0, end=0, batch_size=8192], metrics=[] 
                                                                                
                          |
   |                   |                                                        
                                                                                
                                                                                
                 |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row(s) fetched. 
   Elapsed 0.004 seconds.
   
   > select count(*) from generate_series(0) where value > 10 group by value;
   +----------+
   | count(*) |
   +----------+
   +----------+
   0 row(s) fetched.
   ```
   *   If `group_by_expr.is_empty()` and `input_statistics.num_rows == 0`:
       *   `Final` aggregation mode (`agg.mode == Final`) yields 1 output row. 
But it already return never hit this line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] implement `AggregateExec.partition_statistics` [datafusion]

Reply via email to