Re: Why doesn't spark use broadcast join?

2018-04-18 Thread Kurt Fehlhauer
Try running AnalyzeTableCommand on both tables first.

On Wed, Apr 18, 2018 at 2:57 AM Matteo Cossu  wrote:

> Can you check the value for spark.sql.autoBroadcastJoinThreshold?
>
> On 29 March 2018 at 14:41, Vitaliy Pisarev 
> wrote:
>
>> I am looking at the physical plan for the following query:
>>
>> SELECT f1,f2,f3,...
>> FROM T1
>> LEFT ANTI JOIN T2 ON T1.id = T2.id
>> WHERE  f1 = 'bla'
>>AND f2 = 'bla2'
>>AND some_date >= date_sub(current_date(), 1)
>> LIMIT 100
>>
>> An important detail: the table 'T1' can be very large (hundreds of
>> thousands of rows), but table T2 is rather small. Maximun in the thousands.
>> In this particular case, the table T2 has 2 rows.
>>
>> In the physical plan, I see that a SortMergeJoin is performed. Despite it
>> being the perfect candidate for a broadcast join.
>>
>> What could be the reason for this?
>> Is there a way to hint the optimizer to perform a broadcast join in the
>> sql syntax?
>>
>> I am writing this in pyspark and the query itself is over parquets stored
>> in Azure blob storage.
>>
>>
>>
>


Re: Why doesn't spark use broadcast join?

2018-04-18 Thread Matteo Cossu
Can you check the value for spark.sql.autoBroadcastJoinThreshold?

On 29 March 2018 at 14:41, Vitaliy Pisarev 
wrote:

> I am looking at the physical plan for the following query:
>
> SELECT f1,f2,f3,...
> FROM T1
> LEFT ANTI JOIN T2 ON T1.id = T2.id
> WHERE  f1 = 'bla'
>AND f2 = 'bla2'
>AND some_date >= date_sub(current_date(), 1)
> LIMIT 100
>
> An important detail: the table 'T1' can be very large (hundreds of
> thousands of rows), but table T2 is rather small. Maximun in the thousands.
> In this particular case, the table T2 has 2 rows.
>
> In the physical plan, I see that a SortMergeJoin is performed. Despite it
> being the perfect candidate for a broadcast join.
>
> What could be the reason for this?
> Is there a way to hint the optimizer to perform a broadcast join in the
> sql syntax?
>
> I am writing this in pyspark and the query itself is over parquets stored
> in Azure blob storage.
>
>
>


Re: Why doesn't spark use broadcast join?

2018-03-29 Thread Lalwani, Jayesh
Try putting a Broadcast hint like show here 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html#sql-hints

From: Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
Date: Thursday, March 29, 2018 at 8:42 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Why doesn't spark use broadcast join?

I am looking at the physical plan for the following query:


SELECT f1,f2,f3,...
FROM T1
LEFT ANTI JOIN T2 ON T1.id = T2.id
WHERE  f1 = 'bla'
   AND f2 = 'bla2'
   AND some_date >= date_sub(current_date(), 1)
LIMIT 100
An important detail: the table 'T1' can be very large (hundreds of thousands of 
rows), but table T2 is rather small. Maximun in the thousands.
In this particular case, the table T2 has 2 rows.

In the physical plan, I see that a SortMergeJoin is performed. Despite it being 
the perfect candidate for a broadcast join.

What could be the reason for this?
Is there a way to hint the optimizer to perform a broadcast join in the sql 
syntax?

I am writing this in pyspark and the query itself is over parquets stored in 
Azure blob storage.




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Why doesn't spark use broadcast join?

2018-03-29 Thread Vitaliy Pisarev
I am looking at the physical plan for the following query:

SELECT f1,f2,f3,...
FROM T1
LEFT ANTI JOIN T2 ON T1.id = T2.id
WHERE  f1 = 'bla'
   AND f2 = 'bla2'
   AND some_date >= date_sub(current_date(), 1)
LIMIT 100

An important detail: the table 'T1' can be very large (hundreds of
thousands of rows), but table T2 is rather small. Maximun in the thousands.
In this particular case, the table T2 has 2 rows.

In the physical plan, I see that a SortMergeJoin is performed. Despite it
being the perfect candidate for a broadcast join.

What could be the reason for this?
Is there a way to hint the optimizer to perform a broadcast join in the sql
syntax?

I am writing this in pyspark and the query itself is over parquets stored
in Azure blob storage.