[GitHub] [spark] wangyum opened a new pull request #35779: [SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes

GitBox Tue, 08 Mar 2022 19:13:49 -0800


wangyum opened a new pull request #35779:
URL: https://github.com/apache/spark/pull/35779



   ### What changes were proposed in this pull request?
   
   1. This pr add a new logical plan visitor named `DistinctKeyVisitor` to find 
out all the distinct attributes in current logical plan. For example:
      ```scala
      spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
      spark.sql("SELECT a, b, a % 10, a AS aliased_a, max(c), sum(b) FROM t 
GROUP BY a, b").queryExecution.analyzed.distinctKeys
      ```
      The output is: {a#1, b#2}, {b#2, aliased_a#0}.
   
   2. Enhance `RemoveRedundantAggregates` to remove the aggregation if it is 
groupOnly and the child can guarantee distinct. For example:
      ```sql
      set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid 
PushDownLeftSemiAntiJoin
      create table t1 using parquet as select id a, id as b from range(10);
      create table t2 using parquet as select id as a, id as b from range(8);
      select t11.a, t11.b from (select distinct a, b from t1) t11 left semi 
join t2 on (t11.a = t2.a) group by t11.a, t11.b;
      ```
   
      Before this PR:
      ```
      == Optimized Logical Plan ==
      Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
      +- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
         :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 
B)
         :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
         :     +- Relation default.t1[a#6L,b#7L] parquet, 
Statistics(sizeInBytes=1492.0 B)
         +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
            +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
               +- Relation default.t2[a#8L,b#9L] parquet, 
Statistics(sizeInBytes=1476.0 B)
      ```
   
      After this PR:
      ```
      == Optimized Logical Plan ==
      Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
      :- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
      :  +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
      :     +- Relation default.t1[a#6L,b#7L] parquet, 
Statistics(sizeInBytes=1492.0 B)
      +- Project [a#8L], Statistics(sizeInBytes=984.0 B)
         +- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
            +- Relation default.t2[a#8L,b#9L] parquet, 
Statistics(sizeInBytes=1476.0 B)
      ```
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test and TPC-DS benchmark test.
   
   SQL | Before this PR(Seconds) | After this PR(Seconds)
   -- | -- | --
   q14a | 206  | 193
   q38 | 59 | 41
   q87 | 127 | 113
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangyum opened a new pull request #35779: [SPARK-36194][SQL] Add a logical plan visitor to propagate the distinct attributes

Reply via email to