Hi all:
From google , I know that:
````
Spark can only pick BroadcastNestedLoopJoin to implement left/right join.
````
but why I use following case , broascastnestedLoopJoin became Sortmerged
join when set spark.sql.autoBroadcastJoinThreshold=-1;
{code}
set spark.sql.autoBroadcastJoinThreshold=-1;
explain select a.key1, b.key3 from testdata1 as a left join testdata3 as b
where a.value1=b.value3;
== Physical Plan ==
*(5) Project [key1#18, key3#20]
+- *(5) SortMergeJoin [value1#19], [value3#21], Inner
:- *(2) Sort [value1#19 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(value1#19, 200)
: +- *(1) Filter isnotnull(value1#19)
: +- HiveTableScan [key1#18, value1#19], HiveTableRelation
`default`.`testdata1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
[key1#18, value1#19]
+- *(4) Sort [value3#21 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value3#21, 200)
+- *(3) Filter isnotnull(value3#21)
+- HiveTableScan [key3#20, value3#21], HiveTableRelation
`default`.`testdata3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
[key3#20, value3#21]
{code}
Appreciate to know your suggestion
Best Regards
Kelly Zhang
在 2019-10-23 22:12:35,"angers.zhu" <[email protected]> 写道:
where not in ( query block)
condition will been change to LeftSemi join in optimizer rule
RewritePredicateSubquery.
Then as cloud-fan said, it will be change to a BroadCastNestLoopJoin
| |
angers.zhu
|
|
[email protected]
|
签名由网易邮箱大师定制
On 10/23/2019 20:55,Wenchen Fan<[email protected]> wrote:
I haven't looked into your query yet, just want to let you know that: Spark can
only pick BroadcastNestedLoopJoin to implement left/right join. If the table is
very big, then OOM happens.
Maybe there is an algorithm to implement left/right join in a distributed
environment without broadcast, but currently Spark is only able to deal with it
using broadcast.
On Wed, Oct 23, 2019 at 6:02 PM zhangliyun <[email protected]> wrote:
Hi all:
i want to ask a question about broadcast nestloop join? from google i know,
that
left outer/semi join and right outer/semi join will use broadcast nestloop.
and in some cases, when the input data is very small, it is suitable to use.
so here
how to define the input data very small? what parameter decides the
threshold? I just want to disable it ( i found that set
spark.sql.autoBroadcastJoinThreshold= -1 is no work for sql:select a.key1 from
testdata1 as a where a.key1 not in (select key3 from testdata3) )
```
explain cost select a.key1 from testdata1 as a where a.key1 not in (select
key3 from testdata3);
== Physical Plan ==
*(1) Project [key1#90]
+- BroadcastNestedLoopJoin BuildRight, LeftAnti, ((key1#90 = key3#92) ||
isnull((key1#90 = key3#92)))
:- HiveTableScan [key1#90], HiveTableRelation `default`.`testdata1`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key1#90, value1#91]
+- BroadcastExchange IdentityBroadcastMode
+- HiveTableScan [key3#92], HiveTableRelation `default`.`testdata3`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key3#92, value3#93]
```
my question is
1. why in not in subquery , BroadcastNestedLoopJoin is still used even i set
spark.sql.autoBroadcastJoinThreshold= -1
2. which spark parameter decides enable/disable BroadcastNestedLoopJoin.
Appreciate if you have suggestion
Best Regards
Kelly Zhang