Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
I actually went through the sort-merge algorithm and found out that it compares the two values and actually resets the respective pointer to the last matched pointer and then goes on comparing and fetches the records. Could you please go through

Re: Spark Explain Plan and Joins

2022-02-23 Thread Mich Talebzadeh
Yes correct because sort-merge can only work for equijoins. The point being that join columns are sortable in each DF. In a sort-merge join, the optimizer sorts the first DF by its join columns, sorts the second DF by its join columns, and then merges the intermediate result sets together. As

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
From what I understood, you are asking whether sort-merge can be used in either of the conditions? If my understanding is correct then yes because it supports equi joins. Please correct me if I'm wrong. On Thu, Feb 24, 2022 at 1:49 AM Mich Talebzadeh wrote: > OK let me put this question to you

Re: Spark Explain Plan and Joins

2022-02-23 Thread Mich Talebzadeh
OK let me put this question to you if I may What is the essence for sort-merge assuming we have a SARG WHERE D.deptno = E.deptno? Can we have a sort-merge for WHERE D.deptno >= E.deptno! view my Linkedin profile

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
Hi Mich, Thanks for the link. I will go through it. I have two doubts regarding sort-merge join. 1) I came across one article where it mentioned that it is a better join technique since it doesn't have to scan the entire tables since the keys are sorted. If I have keys like 1,2,4,10 and other

Re: Spark Explain Plan and Joins

2022-02-23 Thread Mich Talebzadeh
Hi Sid, For now, with regard to point 2 2) Predicate push down under the optimized logical plan. Could you please help me to understand the predicate pushdown with some other simple example? Please see this good explanation with examples Using Spark predicate push down in Spark SQL queries

Re: Spark Explain Plan and Joins

2022-02-23 Thread Sid
Hi, Can you help me with my doubts? Any links would also be helpful. Thanks, Sid On Wed, Feb 23, 2022 at 1:22 AM Sid Kal wrote: > Hi Mich / Gourav, > > Thanks for your time :) Much appreciated. I went through the article > shared by Mich about the query execution plan. I pretty much

Re: Spark Explain Plan and Joins

2022-02-22 Thread Sid Kal
Hi Mich / Gourav, Thanks for your time :) Much appreciated. I went through the article shared by Mich about the query execution plan. I pretty much understood most of the things till now except the two things below. 1) HashAggregate in the plan? Does this always indicate "group by" columns? 2)

Re: Spark Explain Plan and Joins

2022-02-21 Thread Gourav Sengupta
Hi, I think that the best option is to use the SPARK UI. In SPARK 3.x the UI and its additional settings are fantastic. Try to also see the settings for Adaptive Query Execution in SPARK, under certain conditions it really works wonders. For certain long queries, the way you are finally

Re: Spark Explain Plan and Joins

2022-02-20 Thread Gourav Sengupta
Hi, what are you trying to achieve by this? If there is a performance deterioration, try to collect the query execution run time statistics from SPARK SQL. They can be seen from the SPARK SQL UI and available over API's in case I am not wrong. Please ensure that you are not trying to over

Re: Spark Explain Plan and Joins

2022-02-20 Thread Mich Talebzadeh
Hi Sid, This article is concise and pretty up-to-date. Spark’s Logical and Physical plans … When, Why, How and Beyond. It is a good start. If after reading it, some stuff needs to be explained,

Re: Spark Explain Plan and Joins

2022-02-20 Thread Sid
Thank you so much for your reply, Mich. I will go through it. However, I want to understand how to read this plan? If I face any errors or if I want to look how spark is cost optimizing or how should we approach it? Could you help me in layman terms? Thanks, Sid On Sun, 20 Feb 2022, 17:50 Mich

Re: Spark Explain Plan and Joins

2022-02-20 Thread Mich Talebzadeh
Do a Google search on *sort-merge spark*. There are plenty of notes on the topic and examples. NLJ, Sort-merge and Hash-joins and derivatives are common join algorithms in database systems. These were not created by Spark. At a given time, there are reasons why one specific join is preferred over