beliefer opened a new pull request, #42223:
URL: https://github.com/apache/spark/pull/42223

   ### What changes were proposed in this pull request?
   Some queries contains multiple scalar subquery(aggregation without group by 
clause) and connected with join.
   For example,
   ```
   SELECT *
   FROM (SELECT
     avg(power) avg_power,
     count(power) count_power,
     count(DISTINCT power) count_distinct_power
   FROM data
   WHERE country = "USA"
     AND (id BETWEEN 1 AND 3
     OR city = "Berkeley"
     OR name = "Xiao")) B1,
     (SELECT
       avg(power) avg_power,
       count(power) count_power,
       count(DISTINCT power) count_distinct_power
     FROM data
     WHERE country = "China"
       AND (id BETWEEN 4 AND 5
       OR city = "Hangzhou"
       OR name = "Wenchen")) B2
   ```
   If we can merge the filters and aggregates, we can scan data source only 
once and eliminate the join so as avoid shuffle.
   
   This PR also supports eliminate nested Join, please refer to: 
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q28.sql
   
   Obviously, this change will improve the performance.
   
   
   ### Why are the changes needed?
   Improve the performance for the case show above.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   New test cases and micro benchmark.
   
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   Benchmark EliminateJoinByCombineAggregate:               Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
---------------------------------------------------------------------------------------------------------------------------------------
   step is 1000000, EliminateJoinByCombineAggregate: false            601       
     738         119         34.9          28.6       1.0X
   step is 1000000, EliminateJoinByCombineAggregate: true)            721       
     747          28         29.1          34.4       0.8X
   step is 100000, EliminateJoinByCombineAggregate: false             355       
     402          68         59.0          16.9       1.7X
   step is 100000, EliminateJoinByCombineAggregate: true)             214       
     222          10         98.0          10.2       2.8X
   step is 10000, EliminateJoinByCombineAggregate: false              321       
     371          26         65.4          15.3       1.9X
   step is 10000, EliminateJoinByCombineAggregate: true)              158       
     183          17        132.5           7.5       3.8X
   step is 1000, EliminateJoinByCombineAggregate: false               311       
     348          24         67.5          14.8       1.9X
   step is 1000, EliminateJoinByCombineAggregate: true)               145       
     161          12        144.2           6.9       4.1X
   step is 100, EliminateJoinByCombineAggregate: false                305       
     330          27         68.7          14.6       2.0X
   step is 100, EliminateJoinByCombineAggregate: true)                142       
     151           8        147.8           6.8       4.2X
   step is 10, EliminateJoinByCombineAggregate: false                 310       
     345          29         67.7          14.8       1.9X
   step is 10, EliminateJoinByCombineAggregate: true)                 139       
     144           6        150.4           6.7       4.3X
   step is 1, EliminateJoinByCombineAggregate: false                  304       
     318          12         69.0          14.5       2.0X
   step is 1, EliminateJoinByCombineAggregate: true)                  119       
     125           5        175.8           5.7       5.0X
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to