peter-toth opened a new pull request #31913:
URL: https://github.com/apache/spark/pull/31913
### What changes were proposed in this pull request?
The `CollapseProject` rule can collapse a `Project` over an `Aggregate` node
into one `Aggregate` node. If the expressions in the `Project` and `Aggregate`
nodes are complex ones then further optimizations can happen on the collapsed
expressions. These optimizations can result that an expression without an
aggregate function in the `Aggregate` node doesn't reference to any grouping
expressions any longer.
Here is a simple example:
```
SELECT not(id)
FROM (
SELECT t.id IS NULL AS id
FROM t
GROUP BY t.id IS NULL
) t
```
In this case the `CollapseProject` does this:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [NOT id#224 AS (NOT id)#227, c#225L]
Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT id)#227, count(1)
AS c#225L]
!+- Aggregate [isnull(id#222)], [isnull(id#222) AS id#224, count(1) AS
c#225L] +- Project [value#219 AS id#222]
! +- Project [value#219 AS id#222]
+- LocalRelation [value#219]
! +- LocalRelation [value#219]
```
and then `NOT isnull(id#222)` is optimized to `isnull(id#222)` and so no
longer refers to any grouping expression:
```
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT id)#227, count(1)
AS c#225L] Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT id)#227,
count(1) AS c#225L]
+- Project [value#219 AS id#222]
+- Project [value#219 AS id#222]
+- LocalRelation [value#219]
+- LocalRelation [value#219]
```
This PR introduces `GroupingExpression` as a simple wrapper around complex
grouping expressions to prevent further optimizations on them.
Before this PR:
```
(5) HashAggregate
Input [2]: [isnull(id#222)#230, count#232L]
Keys [1]: [isnull(id#222)#230]
Functions [1]: [count(1)]
Aggregate Attributes [1]: [count(1)#226L]
Results [2]: [isnotnull(id#222) AS (NOT id)#227, count(1)#226L AS c#225L]
```
and running the query throws an error:
```
Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in
[isnull(id#222)#230,count(1)#226L]
```
After this PR:
```
(5) HashAggregate
Input [2]: [isnull(id#222)#242, count#238L]
Keys [1]: [isnull(id#222)#242]
Functions [1]: [count(1)]
Aggregate Attributes [1]: [count(1)#226L]
Results [2]: [NOT groupingexpression(isnull(id#222)#242) AS (NOT id)#227,
count(1)#226L AS c#225L]
```
and the query works.
### Why are the changes needed?
To keep grouping expressions in aggregate expressions when it is required
for an aggregate expression to refer to a grouping expression.
### Does this PR introduce _any_ user-facing change?
Yes, the query works.
### How was this patch tested?
Added new UT.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]