[GitHub] [spark] ulysses-you commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join

GitBox Sat, 25 Sep 2021 19:41:45 -0700


ulysses-you commented on pull request #34069:
URL: https://github.com/apache/spark/pull/34069#issuecomment-927217863



   hi @c21 , I agree. In general bnlj is much slower than smj. I find some 
extreme case that a left join with very small left side and large right side, 
and unfortunately the right side is also skewed. Then smj does not work good, 
even failed with OOM at skewed partition.
   
   Here a simple benchmark with my local side:
   ```scala
   spark.range(0, 10000000).selectExpr("id % 1 as c1", "id as 
c2").repartition(100).createOrReplaceTempView("t1")
   spark.range(0, 10).selectExpr("id as c1").createOrReplaceTempView("t2")
   
   // 5s
   spark.sql("select /*+ merge(t2) */ count(*) from t2 left join t1 on t1.c1 = 
t2.c1").collect
   
   // 3s
   spark.sql("select /*+ broadcast_nl(t2) */ count(*) from t2 left join t1 on 
t1.c1 = t2.c1").collect
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on pull request #34069: [SPARK-36823][SQL] Support broadcast nested loop join hint for equi-join

Reply via email to