peter-toth opened a new pull request #32396:
URL: https://github.com/apache/spark/pull/32396


   ### What changes were proposed in this pull request?
   This PR adds a new rule `PullOutGroupingExpressions` to pull out complex 
grouping expressions to a `Project` node under an `Aggregate`. These 
expressions are then referenced in both grouping expressions and aggregate 
expressions without aggregate functions to ensure that optimization rules don't 
change the aggregate expressions to invalid ones that no longer refer to any 
grouping expressions.
   
   ### Why are the changes needed?
   If aggregate expressions (without aggregate functions) in an `Aggregate` 
node are complex then the `Optimizer` can optimize out grouping expressions 
from them and so making aggregate expressions invalid.
   
   Here is a simple example:
   ```
   SELECT not(t.id IS NULL) , count(*)
   FROM t
   GROUP BY t.id IS NULL
   ```
   In this case the `BooleanSimplification` rule does this:
   ```
   === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
   !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, 
count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT 
(id IS NULL))#226, count(1) AS c#224L]
    +- Project [value#219 AS id#222]                                            
                     +- Project [value#219 AS id#222]
       +- LocalRelation [value#219]                                             
                        +- LocalRelation [value#219]                            
              
   ```
   where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no 
longer refers to any grouping expression.
   
   Before this PR:
   ```
   == Optimized Logical Plan ==
   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, 
count(1) AS c#232L]
   +- Project [value#219 AS id#222]
      +- LocalRelation [value#219]
   ```
   and running the query throws an error:
   ```
   Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
   java.lang.IllegalStateException: Couldn't find id#222 in 
[isnull(id#222)#230,count(1)#226L]
   ```
   
   After this PR:
   ```
   == Optimized Logical Plan ==
   Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT 
(id IS NULL))#230, count(1) AS c#228L]
   +- Project [isnull(value#219) AS _groupingexpression#233]
      +- LocalRelation [value#219]
   ```
   and the query works.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the query works.
   
   
   ### How was this patch tested?
   Added new UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to