wangyum opened a new pull request #35216:
URL: https://github.com/apache/spark/pull/35216


   ### What changes were proposed in this pull request?
   
   It is safe to push down the limit 1 for the right side of left semi/anti 
join if the join condition is empty, since we only care if the right side is 
empty. For example:
   ```scala
   val numRows = 1024 * 1024 * 40
   
   spark.sql(s"CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b, id AS 
c FROM range(1, ${numRows}L, 1, 5)")
   spark.sql(s"CREATE TABLE t2 using parquet AS SELECT id AS a, id AS b, id AS 
c FROM range(1, ${numRows}L, 1, 5)")
   
   spark.sql("SELECT * FROM t1 LEFT SEMI JOIN t2 LIMIT 5").explain(true)
   ```
   
   Before this pr:
   ```
   == Optimized Logical Plan ==
   GlobalLimit 5
   +- LocalLimit 5
      +- Join LeftSemi
         :- LocalLimit 5
         :  +- Relation default.t1[a#8L,b#9L,c#10L] parquet
         +- Project
            +- Relation default.t2[a#11L,b#12L,c#13L] parquet
   ```
   
   After this pr:
   ```
   == Optimized Logical Plan ==
   GlobalLimit 5
   +- LocalLimit 5
      +- Join LeftSemi
         :- LocalLimit 5
         :  +- Relation default.t1[a#8L,b#9L,c#10L] parquet
         +- LocalLimit 1
            +- Project
               +- Relation default.t2[a#11L,b#12L,c#13L] parquet
   ```
   
   ### Why are the changes needed?
   
   Improve query performance.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to