[ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880747#comment-15880747
 ] 

Nattavut Sutyanyong commented on SPARK-19712:
---------------------------------------------

This is because Optimizer treats EXISTS as a predicate of a {{Filter}} that can 
be pushed down through a join and hence it is moved below the join between T1 
and T2. Later on the rule {{RewritePredicateSubquery}} converts the EXISTS to a 
Left Semi join. The Left Semi join form is considered a join. Spark does not do 
extensive join reordering to move the Left Semi join below the (inner) join 
between T1 and T2.

The question here is "should we move the Left Semi join down?" The answer is it 
depends on how much it can filter. The inner join to T2 may have a better 
filtering effect and therefore joining T2 to T1 before the left semi join to T3 
is a better. The decision to do or not to do the join reordering should be done 
with the analysis of the statistics of the join columns between the 3 tables.

>From the viewpoint of a rule-based optimization (aka query rewrite), these two 
>queries should generate the same optimal plan and leave the task of join 
>reorder to the cost-based component.

> EXISTS and Left Semi join do not produce the same plan
> ------------------------------------------------------
>
>                 Key: SPARK-19712
>                 URL: https://issues.apache.org/jira/browse/SPARK-19712
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Nattavut Sutyanyong
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> :     +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>    +- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> :     +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>    +- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


Reply via email to