peter-toth opened a new pull request #31913:
URL: https://github.com/apache/spark/pull/31913


   ### What changes were proposed in this pull request?
   The `CollapseProject` rule can collapse a `Project` over an `Aggregate` node 
into one `Aggregate` node. If the expressions in the `Project` and `Aggregate` 
nodes are complex ones then further optimizations can happen on the collapsed 
expressions. These optimizations can result that an expression without an 
aggregate function in the `Aggregate` node doesn't reference to any grouping 
expressions any longer.
   
   Here is a simple example:
   ```
   SELECT not(id)
   FROM (
     SELECT t.id IS NULL AS id
     FROM t
     GROUP BY t.id IS NULL
   ) t
   ```
   In this case the `CollapseProject` does this:
   ```
   === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
   !Project [NOT id#224 AS (NOT id)#227, c#225L]                                
    Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT id)#227, count(1) 
AS c#225L]
   !+- Aggregate [isnull(id#222)], [isnull(id#222) AS id#224, count(1) AS 
c#225L]   +- Project [value#219 AS id#222]
   !   +- Project [value#219 AS id#222]                                         
       +- LocalRelation [value#219]
   !      +- LocalRelation [value#219]                                          
    
   ```
   and then `NOT isnull(id#222)` is optimized to `isnull(id#222)` and so no 
longer refers to any grouping expression:
   ```
   === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
   !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT id)#227, count(1) 
AS c#225L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT id)#227, 
count(1) AS c#225L]
    +- Project [value#219 AS id#222]                                            
           +- Project [value#219 AS id#222]
       +- LocalRelation [value#219]                                             
              +- LocalRelation [value#219]
   ```
   
   This PR introduces `GroupingExpression` as a simple wrapper around complex 
grouping expressions to prevent further optimizations on them.
   
   Before this PR:
   ```
   (5) HashAggregate
   Input [2]: [isnull(id#222)#230, count#232L]
   Keys [1]: [isnull(id#222)#230]
   Functions [1]: [count(1)]
   Aggregate Attributes [1]: [count(1)#226L]
   Results [2]: [isnotnull(id#222) AS (NOT id)#227, count(1)#226L AS c#225L]
   ```
   and running the query throws an error:
   ```
   Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
   java.lang.IllegalStateException: Couldn't find id#222 in 
[isnull(id#222)#230,count(1)#226L]
   ```
   
   After this PR:
   ```
   (5) HashAggregate
   Input [2]: [isnull(id#222)#242, count#238L]
   Keys [1]: [isnull(id#222)#242]
   Functions [1]: [count(1)]
   Aggregate Attributes [1]: [count(1)#226L]
   Results [2]: [NOT groupingexpression(isnull(id#222)#242) AS (NOT id)#227, 
count(1)#226L AS c#225L]
   ```
   and the query works.
   
   
   ### Why are the changes needed?
   To keep grouping expressions in aggregate expressions when it is required 
for an aggregate expression to refer to a grouping expression.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the query works.
   
   
   ### How was this patch tested?
   Added new UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to