bersprockets opened a new pull request, #37825:
URL: https://github.com/apache/spark/pull/37825

   ### What changes were proposed in this pull request?
   
   In `RewriteDistinctAggregates`, when grouping aggregate expressions by 
function children, treat children that are semantically equivalent as the same.
   
   ### Why are the changes needed?
   
   This PR will reduce the number of projections in the Expand operator when 
there are multiple distinct aggregations with superficially different children. 
In some cases, it will eliminate the need for an Expand operator.
   
   Example: In the following query, the Expand operator creates 3\*n rows 
(where n is the number of incoming rows) because it has a projection for each 
of function children `b + 1`, `1 + b` and `c`.
   
   ```
   create or replace temp view v1 as
   select * from values
   (1, 2, 3.0),
   (1, 3, 4.0),
   (2, 4, 2.5),
   (2, 3, 1.0)
   v1(a, b, c);
   
   select
     a,
     count(distinct b + 1),
     avg(distinct 1 + b) filter (where c > 0),
     sum(c)
   from
     v1
   group by a;
   ```
   The Expand operator has three projections (each producing a row for each 
incoming row):
   ```
   [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for 
regular aggregation)
   [a#87, (b#88 + 1), null, 1, null, null],          <== projection #2 (for 
distinct aggregation of b + 1)
   [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for 
distinct aggregation of 1 + b)
   ```
   In reality, the Expand only needs one projection for `1 + b` and `b + 1`, 
because they are semantically equivalent.
   
   With the proposed change, the Expand operator's projections look like this:
   ```
   [a#67, null, 0, null, UnscaledValue(c#69)],  <== projection #1 (for regular 
aggregations)
   [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]],  <== projection #2 (for distinct 
aggregation on b + 1 and 1 + b)
   ```
   With one less projection, Expand produces 2\*n rows instead of 3\*n rows, 
but still produces the correct result.
   
   In the case where all distinct aggregates have semantically equivalent 
children, the Expand operator is not needed at all.
   
   See benchmark results in the JIRA (SPARK-40382).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to