beliefer opened a new pull request, #42223:
URL: https://github.com/apache/spark/pull/42223
### What changes were proposed in this pull request?
Some queries contains multiple scalar subquery(aggregation without group by
clause) and connected with join.
For example,
```
SELECT *
FROM (SELECT
avg(power) avg_power,
count(power) count_power,
count(DISTINCT power) count_distinct_power
FROM data
WHERE country = "USA"
AND (id BETWEEN 1 AND 3
OR city = "Berkeley"
OR name = "Xiao")) B1,
(SELECT
avg(power) avg_power,
count(power) count_power,
count(DISTINCT power) count_distinct_power
FROM data
WHERE country = "China"
AND (id BETWEEN 4 AND 5
OR city = "Hangzhou"
OR name = "Wenchen")) B2
```
If we can merge the filters and aggregates, we can scan data source only
once and eliminate the join so as avoid shuffle.
This PR also supports eliminate nested Join, please refer to:
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q28.sql
Obviously, this change will improve the performance.
### Why are the changes needed?
Improve the performance for the case show above.
### Does this PR introduce _any_ user-facing change?
'No'.
New feature.
### How was this patch tested?
New test cases and micro benchmark.
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Benchmark EliminateJoinByCombineAggregate: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------------
step is 1000000, EliminateJoinByCombineAggregate: false 601
738 119 34.9 28.6 1.0X
step is 1000000, EliminateJoinByCombineAggregate: true) 721
747 28 29.1 34.4 0.8X
step is 100000, EliminateJoinByCombineAggregate: false 355
402 68 59.0 16.9 1.7X
step is 100000, EliminateJoinByCombineAggregate: true) 214
222 10 98.0 10.2 2.8X
step is 10000, EliminateJoinByCombineAggregate: false 321
371 26 65.4 15.3 1.9X
step is 10000, EliminateJoinByCombineAggregate: true) 158
183 17 132.5 7.5 3.8X
step is 1000, EliminateJoinByCombineAggregate: false 311
348 24 67.5 14.8 1.9X
step is 1000, EliminateJoinByCombineAggregate: true) 145
161 12 144.2 6.9 4.1X
step is 100, EliminateJoinByCombineAggregate: false 305
330 27 68.7 14.6 2.0X
step is 100, EliminateJoinByCombineAggregate: true) 142
151 8 147.8 6.8 4.2X
step is 10, EliminateJoinByCombineAggregate: false 310
345 29 67.7 14.8 1.9X
step is 10, EliminateJoinByCombineAggregate: true) 139
144 6 150.4 6.7 4.3X
step is 1, EliminateJoinByCombineAggregate: false 304
318 12 69.0 14.5 2.0X
step is 1, EliminateJoinByCombineAggregate: true) 119
125 5 175.8 5.7 5.0X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]