wangyum opened a new pull request #33361:
URL: https://github.com/apache/spark/pull/33361


   ### What changes were proposed in this pull request?
   
   1. Elimination of left semi -> inner if uniqueness can be guaranteed on the 
right side.
   2. Removes outer join if it only references the streamed side and the 
uniqueness can be guaranteed on the buffered side.
   
   For example:
   ```sql
   create table t1(id int, name string) using parquet;
   create table t2(id int, name string) using parquet;
   
   select * from t1 left semi join (select distinct id from t2) t21 on t1.id = 
t21.id;
   select t1.name from t1 left join (select distinct id from t2) t21 on t1.id = 
t21.id;
   ```
   
   Before this PR:
   ```
   == Optimized Logical Plan ==
   Join LeftSemi, (id#0 = id#2)
   :- Filter isnotnull(id#0)
   :  +- Relation default.t1[id#0,name#1] parquet
   +- Aggregate [id#2], [id#2]
      +- Project [id#2]
         +- Filter isnotnull(id#2)
            +- Relation default.t2[id#2,name#3] parquet
   
   == Optimized Logical Plan ==
   Project [name#1]
   +- Join LeftOuter, (id#0 = id#2)
      :- Relation default.t1[id#0,name#1] parquet
      +- Aggregate [id#2], [id#2]
         +- Project [id#2]
            +- Filter isnotnull(id#2)
               +- Relation default.t2[id#2,name#3] parquet
   ```
   
   After this PR:
   ```
   == Optimized Logical Plan ==
   Join Inner, (id#0 = id#2)
   :- Filter isnotnull(id#0)
   :  +- Relation default.t1[id#0,name#1] parquet
   +- Aggregate [id#2], [id#2]
      +- Project [id#2]
         +- Filter isnotnull(id#2)
            +- Relation default.t2[id#2,name#3] parquet
   
   == Optimized Logical Plan ==
   Project [name#1]
   +- Relation default.t1[id#0,name#1] parquet
   ```
   
   ### Why are the changes needed?
   
   1. Improve query performance.
   2. PostgreSQL support these optimization.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to