Re: Why doesn't spark use broadcast join?
Try running AnalyzeTableCommand on both tables first. On Wed, Apr 18, 2018 at 2:57 AM Matteo Cossuwrote: > Can you check the value for spark.sql.autoBroadcastJoinThreshold? > > On 29 March 2018 at 14:41, Vitaliy Pisarev > wrote: > >> I am looking at the physical plan for the following query: >> >> SELECT f1,f2,f3,... >> FROM T1 >> LEFT ANTI JOIN T2 ON T1.id = T2.id >> WHERE f1 = 'bla' >>AND f2 = 'bla2' >>AND some_date >= date_sub(current_date(), 1) >> LIMIT 100 >> >> An important detail: the table 'T1' can be very large (hundreds of >> thousands of rows), but table T2 is rather small. Maximun in the thousands. >> In this particular case, the table T2 has 2 rows. >> >> In the physical plan, I see that a SortMergeJoin is performed. Despite it >> being the perfect candidate for a broadcast join. >> >> What could be the reason for this? >> Is there a way to hint the optimizer to perform a broadcast join in the >> sql syntax? >> >> I am writing this in pyspark and the query itself is over parquets stored >> in Azure blob storage. >> >> >> >
Re: Why doesn't spark use broadcast join?
Can you check the value for spark.sql.autoBroadcastJoinThreshold? On 29 March 2018 at 14:41, Vitaliy Pisarevwrote: > I am looking at the physical plan for the following query: > > SELECT f1,f2,f3,... > FROM T1 > LEFT ANTI JOIN T2 ON T1.id = T2.id > WHERE f1 = 'bla' >AND f2 = 'bla2' >AND some_date >= date_sub(current_date(), 1) > LIMIT 100 > > An important detail: the table 'T1' can be very large (hundreds of > thousands of rows), but table T2 is rather small. Maximun in the thousands. > In this particular case, the table T2 has 2 rows. > > In the physical plan, I see that a SortMergeJoin is performed. Despite it > being the perfect candidate for a broadcast join. > > What could be the reason for this? > Is there a way to hint the optimizer to perform a broadcast join in the > sql syntax? > > I am writing this in pyspark and the query itself is over parquets stored > in Azure blob storage. > > >
Re: Why doesn't spark use broadcast join?
Try putting a Broadcast hint like show here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hint-framework.html#sql-hints From: Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> Date: Thursday, March 29, 2018 at 8:42 AM To: "user@spark.apache.org" <user@spark.apache.org> Subject: Why doesn't spark use broadcast join? I am looking at the physical plan for the following query: SELECT f1,f2,f3,... FROM T1 LEFT ANTI JOIN T2 ON T1.id = T2.id WHERE f1 = 'bla' AND f2 = 'bla2' AND some_date >= date_sub(current_date(), 1) LIMIT 100 An important detail: the table 'T1' can be very large (hundreds of thousands of rows), but table T2 is rather small. Maximun in the thousands. In this particular case, the table T2 has 2 rows. In the physical plan, I see that a SortMergeJoin is performed. Despite it being the perfect candidate for a broadcast join. What could be the reason for this? Is there a way to hint the optimizer to perform a broadcast join in the sql syntax? I am writing this in pyspark and the query itself is over parquets stored in Azure blob storage. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Why doesn't spark use broadcast join?
I am looking at the physical plan for the following query: SELECT f1,f2,f3,... FROM T1 LEFT ANTI JOIN T2 ON T1.id = T2.id WHERE f1 = 'bla' AND f2 = 'bla2' AND some_date >= date_sub(current_date(), 1) LIMIT 100 An important detail: the table 'T1' can be very large (hundreds of thousands of rows), but table T2 is rather small. Maximun in the thousands. In this particular case, the table T2 has 2 rows. In the physical plan, I see that a SortMergeJoin is performed. Despite it being the perfect candidate for a broadcast join. What could be the reason for this? Is there a way to hint the optimizer to perform a broadcast join in the sql syntax? I am writing this in pyspark and the query itself is over parquets stored in Azure blob storage.