Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to implement left/right join" this should be "left/right non-equal join"
On Thu, Oct 24, 2019 at 6:32 AM zhangliyun <kelly...@126.com> wrote: > > Hi Herman: > I guess what you mentioned before > ``` > if you are OK with slightly different NULL semantics then you could use NOT > EXISTS(subquery). The latter should perform a lot better. > > ``` > is the NULL key1 of left table will be retained if NULL key2 is not found > in the right table ( join condition : left.key1 = right.key2) in exists > semantics while this will not happen in > "in semantics". If my understanding wrong, tell me. > > > > Best Regards. > > Kelly Zhang > > > > > > > > > 在 2019-10-23 19:16:34,"Herman van Hovell" <her...@databricks.com> 写道: > > In some cases BroadcastNestedLoopJoin is the only viable join method. In > your example for instance you are using a non-equi join condition and BNLJ > is the only method that works in that case. This is also the reason why you > can't disable it using the spark.sql.autoBroadcastJoinThreshold > configuration. > > Such a plan is generally generated by using a NOT IN (subquery), if you > are OK with slightly different NULL semantics then you could use NOT > EXISTS(subquery). The latter should perform a lot better. > > On Wed, Oct 23, 2019 at 12:02 PM zhangliyun <kelly...@126.com> wrote: > >> Hi all: >> i want to ask a question about broadcast nestloop join? from google i >> know, that >> left outer/semi join and right outer/semi join will use broadcast >> nestloop. >> and in some cases, when the input data is very small, it is suitable to >> use. so here >> how to define the input data very small? what parameter decides the >> threshold? I just want to disable it ( i found that set >> spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1 >> from testdata1 as a where a.key1 not in (select key3 from testdata3) ) >> >> >> ``` >> >> explain cost select a.key1 from testdata1 as a where a.key1 not in >> (select key3 from testdata3); >> >> == Physical Plan == >> *(1) Project [key1#90] >> +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || >> isnull((key1#90 = key3#92))) >> :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, >> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91] >> +- BroadcastExchange IdentityBroadcastMode >> +- HiveTableScan [key3#92], HiveTableRelation >> `default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, >> [key3#92, value3#93] >> >> ``` >> >> my question is >> 1. why in not in subquery , BroadcastNestedLoopJoin is still used even >> i set spark.sql.autoBroadcastJoinThreshold= -1 >> 2. which spark parameter decides enable/disable >> BroadcastNestedLoopJoin. >> >> >> >> Appreciate if you have suggestion >> >> >> Best Regards >> >> Kelly Zhang >> >> >> >> > > > >