bersprockets opened a new pull request, #37825:
URL: https://github.com/apache/spark/pull/37825
### What changes were proposed in this pull request?
In `RewriteDistinctAggregates`, when grouping aggregate expressions by
function children, treat children that are semantically equivalent as the same.
### Why are the changes needed?
This PR will reduce the number of projections in the Expand operator when
there are multiple distinct aggregations with superficially different children.
In some cases, it will eliminate the need for an Expand operator.
Example: In the following query, the Expand operator creates 3\*n rows
(where n is the number of incoming rows) because it has a projection for each
of function children `b + 1`, `1 + b` and `c`.
```
create or replace temp view v1 as
select * from values
(1, 2, 3.0),
(1, 3, 4.0),
(2, 4, 2.5),
(2, 3, 1.0)
v1(a, b, c);
select
a,
count(distinct b + 1),
avg(distinct 1 + b) filter (where c > 0),
sum(c)
from
v1
group by a;
```
The Expand operator has three projections (each producing a row for each
incoming row):
```
[a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for
regular aggregation)
[a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for
distinct aggregation of b + 1)
[a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for
distinct aggregation of 1 + b)
```
In reality, the Expand only needs one projection for `1 + b` and `b + 1`,
because they are semantically equivalent.
With the proposed change, the Expand operator's projections look like this:
```
[a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular
aggregations)
[a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct
aggregation on b + 1 and 1 + b)
```
With one less projection, Expand produces 2\*n rows instead of 3\*n rows,
but still produces the correct result.
In the case where all distinct aggregates have semantically equivalent
children, the Expand operator is not needed at all.
See benchmark results in the JIRA (SPARK-40382).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]