phillipleblanc opened a new issue, #16126:
URL: https://github.com/apache/datafusion/issues/16126

   ### Describe the bug
   
   DataFusion turns aggregation computations from a LogicalPlan node into 
column references in the higher level plans. To illustrate, consider the 
following plan:
   
   ```bash
   Projection: item.i_category, count(*)                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        
    Sort: count(Int64(1)) AS count(*) AS count(*) ASC NULLS LAST                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        
      Projection: item.i_category, count(Int64(1)) AS count(*), count(Int64(1)) 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        
        Aggregate: groupBy=[[item.i_category]], aggr=[[count(Int64(1))]]        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        
          TableScan: item projection=[i_category] 
   ```
   
   The Projection node immediately above the Aggregate is projecting both 
`count(Int64(1)) AS count(*) AS count(*)` and `count(Int64(1))`. The reference 
to `count(Int64(1))` in both expressions is denoted by a Column reference with 
the name literally set to `count(Int64(1))`. This is because from DataFusion's 
point of view, once it computes the aggregation, it no longer cares about how 
that column came to be. So all of the layers above only need to refer to the 
name that DataFusion assigned that column.
   
   This presents a problem to the unparser, because we're tasked with 
translating this logical plan back into a SQL statement that can be executed.
   
   The DataFusion unparser currently handles this by looking for any aggregate 
nodes in the LogicalPlan, and using the `aggr` references to find the 
underlying expression that computed the column with a name like 
`count(Int64(1))`. Once we have the underlying expression, we can send that to 
the expression unparser to get the correct SQL that produced it.
   
   This approach works well, but in the handling for the `ORDER BY` clause 
specifically, DataFusion wasn't being general enough. It was expecting either a 
literal column reference (or an aliased column reference) and then invoking the 
aggregation unprojection logic outlined above. But `ORDER BY` can have 
arbitrary expressions. The above double aliasing is one example - its a bit 
strange, but logically its a valid expression.
   
   Another example is from TPCDS Q36 (simplified to just show the relevant 
part):
   
   ```sql
   select
     i_category,
     i_class,
     grouping(i_category) + grouping(i_class) as lochierarchy
   from
     store_sales,
     item
   group by
     rollup (i_category, i_class)
   order by
     grouping(i_category) + grouping(i_class) desc,
     case
       when grouping(i_category) + grouping(i_class) = 0 then i_category
     end
   LIMIT
     100;
   ```
   
   This example has both a binary expression (`grouping(i_category) + 
grouping(i_class) desc`) and a case statement `case when grouping(i_category) + 
grouping(i_class) = 0 then i_category end`.
   
   These are both valid SQL expressions for ORDER BY clauses, and the column 
references were embedded inside other expressions. But the previous unparsing 
logic for sorts was only expecting simple column/single aliased column 
references.
   
   The reason this worked previously was because DataFusion itself had a bug 
that didn't handle arbitrary order by expressions - so it was a bit of a happy 
coincidence that it worked in the unparser. That bug was fixed in 
https://github.com/apache/datafusion/pull/14486 and allowed DataFusion itself 
to handle arbitrary expressions in the ORDER BY clause. The reason it was a 
happy coincidence is because the bug in DataFusion was not triggered when the 
LogicalPlan was created, only when it was executed. But for the unparser 
usecase, we weren't executing it in DataFusion - we were just using it to 
unparse the LogicalPlan back to SQL.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to