[GitHub] [spark] peter-toth opened a new pull request #32298: [WIP][SPARK-34079][SQL] Merging non-correlated scalar subqueries to multi-column scalar subqueries for better reuse

GitBox Thu, 22 Apr 2021 07:26:53 -0700


peter-toth opened a new pull request #32298:
URL: https://github.com/apache/spark/pull/32298



   ### What changes were proposed in this pull request?
   This PR:
   - Adds a new subquery type `MultiScalarSubquery` / `MultiScalarSubqueryExec` 
to compute multiple scalar values at the same time.
   - Adds a new optimizer rule `MergeScalarSubqueries` to merge similar 
non-correlated scalar subqueries into multi-column scalar subqueries and 
replaces the original scalar subquery expression to 
`GetStructField(MultiScalarSubquery(...))`.
   - Lets the `ReuseSubquery` / `ReuseAdaptiveSubquery` rules to replace 
multiple instances of the same `MultiScalarSubquery` to reuse references to 
make sure a `MultiScalarSubquery` runs only once.
   
   E.g. the following query:
   ```
   SELECT
     (SELECT avg(a) FROM t GROUP BY b),
     (SELECT sum(b) FROM t GROUP BY b)
   ```
   is optimized from:
   ```
   Project [scalar-subquery#231 [] AS scalarsubquery()#241, scalar-subquery#232 
[] AS scalarsubquery()#242L]
   :  :- Aggregate [b#234], [avg(a#233) AS avg(a)#236]
   :  :  +- Relation default.t[a#233,b#234] parquet
   :  +- Aggregate [b#240], [sum(b#240) AS sum(b)#238L]
   :     +- Project [b#240]
   :        +- Relation default.t[a#239,b#240] parquet
   +- OneRowRelation
   ```
   to:
   ```
   Project [multi-scalar-subquery#231.avg(a) AS scalarsubquery()#241, 
multi-scalar-subquery#232.sum(b) AS scalarsubquery()#242L]
   :  :- Aggregate [b#234], [avg(a#233) AS avg(a)#236, sum(b#234) AS 
sum(b)#238L]
   :  :  +- Project [a#233, b#234]
   :  :     +- Relation default.t[a#233,b#234] parquet
   :  +- Aggregate [b#234], [avg(a#233) AS avg(a)#236, sum(b#234) AS 
sum(b)#238L]
   :     +- Project [a#233, b#234]
   :        +- Relation default.t[a#233,b#234] parquet
   +- OneRowRelation
   ```
   
   ### Why are the changes needed?
   Performance improvement.
   ```
   TPCDS Snappy:                                    Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
-------------------------------------------------------------------------------------------------------------------------------
   q9 - spark.sql.scalarSubqueyMerge.enabled=false          45892          
47172        1220          0.0      Infinity       1.0X
   q9 - spark.sql.scalarSubqueyMerge.enabled=true           16769          
16863         124          0.0      Infinity       2.7X
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing UTs. I will add new ones later...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth opened a new pull request #32298: [WIP][SPARK-34079][SQL] Merging non-correlated scalar subqueries to multi-column scalar subqueries for better reuse

Reply via email to