I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens.
Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but currently Spark is only able to deal with it using broadcast. On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <[email protected]> wrote: > Hi all: > i want to ask a question about broadcast nestloop join? from google i > know, that > left outer/semi join and right outer/semi join will use broadcast > nestloop. > and in some cases, when the input data is very small, it is suitable to > use. so here > how to define the input data very small? what parameter decides the > threshold? I just want to disable it ( i found that set > spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1 > from testdata1 as a where a.key1 not in (select key3 from testdata3) ) > > > ``` > > explain cost select a.key1 from testdata1 as a where a.key1 not in > (select key3 from testdata3); > > == Physical Plan == > *(1) Project [key1#90] > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) || > isnull((key1#90 = key3#92))) > :- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91] > +- BroadcastExchange IdentityBroadcastMode > +- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93] > > ``` > > my question is > 1. why in not in subquery , BroadcastNestedLoopJoin is still used even i > set spark.sql.autoBroadcastJoinThreshold= -1 > 2. which spark parameter decides enable/disable BroadcastNestedLoopJoin. > > > > Appreciate if you have suggestion > > > Best Regards > > Kelly Zhang > > > >
