wangyum opened a new pull request #35779:
URL: https://github.com/apache/spark/pull/35779
### What changes were proposed in this pull request?
1. This pr add a new logical plan visitor named `DistinctKeyVisitor` to find
out all the distinct attributes in current logical plan. For example:
```scala
spark.sql("CREATE TABLE t(a int, b int, c int) using parquet")
spark.sql("SELECT a, b, a % 10, a AS aliased_a, max(c), sum(b) FROM t
GROUP BY a, b").queryExecution.analyzed.distinctKeys
```
The output is: {a#1, b#2}, {b#2, aliased_a#0}.
2. Enhance `RemoveRedundantAggregates` to remove the aggregation if it is
groupOnly and the child can guarantee distinct. For example:
```sql
set spark.sql.autoBroadcastJoinThreshold=-1; -- avoid
PushDownLeftSemiAntiJoin
create table t1 using parquet as select id a, id as b from range(10);
create table t2 using parquet as select id as a, id as b from range(8);
select t11.a, t11.b from (select distinct a, b from t1) t11 left semi
join t2 on (t11.a = t2.a) group by t11.a, t11.b;
```
Before this PR:
```
== Optimized Logical Plan ==
Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
+- Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
:- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0
B)
: +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
: +- Relation default.t1[a#6L,b#7L] parquet,
Statistics(sizeInBytes=1492.0 B)
+- Project [a#8L], Statistics(sizeInBytes=984.0 B)
+- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
+- Relation default.t2[a#8L,b#9L] parquet,
Statistics(sizeInBytes=1476.0 B)
```
After this PR:
```
== Optimized Logical Plan ==
Join LeftSemi, (a#6L = a#8L), Statistics(sizeInBytes=1492.0 B)
:- Aggregate [a#6L, b#7L], [a#6L, b#7L], Statistics(sizeInBytes=1492.0 B)
: +- Filter isnotnull(a#6L), Statistics(sizeInBytes=1492.0 B)
: +- Relation default.t1[a#6L,b#7L] parquet,
Statistics(sizeInBytes=1492.0 B)
+- Project [a#8L], Statistics(sizeInBytes=984.0 B)
+- Filter isnotnull(a#8L), Statistics(sizeInBytes=1476.0 B)
+- Relation default.t2[a#8L,b#9L] parquet,
Statistics(sizeInBytes=1476.0 B)
```
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test and TPC-DS benchmark test.
SQL | Before this PR(Seconds) | After this PR(Seconds)
-- | -- | --
q14a | 206 | 193
q38 | 59 | 41
q87 | 127 | 113
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]