Re: Why doesn't spark use broadcast join?

Matteo Cossu Wed, 18 Apr 2018 01:57:56 -0700

Can you check the value for spark.sql.autoBroadcastJoinThreshold?

On 29 March 2018 at 14:41, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
wrote:


> I am looking at the physical plan for the following query:
>
> SELECT f1,f2,f3,...
> FROM T1
> LEFT ANTI JOIN T2 ON T1.id = T2.id
> WHERE  f1 = 'bla'
>        AND f2 = 'bla2'
>        AND some_date >= date_sub(current_date(), 1)
> LIMIT 100
>
> An important detail: the table 'T1' can be very large (hundreds of
> thousands of rows), but table T2 is rather small. Maximun in the thousands.
> In this particular case, the table T2 has 2 rows.
>
> In the physical plan, I see that a SortMergeJoin is performed. Despite it
> being the perfect candidate for a broadcast join.
>
> What could be the reason for this?
> Is there a way to hint the optimizer to perform a broadcast join in the
> sql syntax?
>
> I am writing this in pyspark and the query itself is over parquets stored
> in Azure blob storage.
>
>
>

Re: Why doesn't spark use broadcast join?

Reply via email to