Re: [PR] Fix bug in optimizing a nested count [arrow-datafusion]

via GitHub Thu, 07 Dec 2023 11:53:37 -0800


Dandandan commented on code in PR #8459:
URL: https://github.com/apache/arrow-datafusion/pull/8459#discussion_r1419533541



##########
datafusion/optimizer/src/optimize_projections.rs:
##########
@@ -213,6 +213,16 @@ fn optimize_projections(
             let (aggregate_input, _is_added) =
                 add_projection_on_top_if_helpful(aggregate_input, 
necessary_exprs, true)?;
 
+            // Aggregate always needs at least one aggregate expression.
+            // With a nested count we don't require any column as input, but 
still need to create a correct aggregate
+            // The aggregate may be optimized out later (select count(*) from 
(select count(*) from [...]) always returns 1
+            if new_aggr_expr.is_empty()
+                && new_group_bys.is_empty()
+                && !aggregate.aggr_expr.is_empty()
+            {
+                new_aggr_expr = vec![aggregate.aggr_expr[0].clone()];

Review Comment:
   If there is no column required by the parent query, it means that it only is 
interested in the number of rows (or the existence of any rows), i.e. it 
doesn't matter which expression we take even if we would have multiple.
   `select count(*) from (select count(*) a, count(*) b from (select 1));`
   are equivalent, and as a matter of fact a optimized away later by 
`AggregateStatistics`:
   
   ```
   | physical_plan after aggregate_statistics                   | 
OutputRequirementExec                                                           
                                    |
   |                                                            |   
ProjectionExec: expr=[1 as COUNT(*)]                                            
                                  |
   |                                                            |     
EmptyExec: produce_one_row=true                                                 
                                |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix bug in optimizing a nested count [arrow-datafusion]

Reply via email to