[ 
https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12602:
----------------------------
    Description: 
If applicable, we can push Inner Join through Outer Join. This can reduce the 
number of rows since the Inner Join always can generate less rows than Outer 
Join. 

This PR is to push `Inner Join` through `Left/Right Outer Join`. The reordering 
can reduce the number of processed rows since the `Inner Join` always can 
generate less rows than `Left/Right Outer Join`. 

This PR can improve the query performance, if applicable.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan should be like
{code}
Join RightOuter, Some((int#3 = int#9))
:- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
+- Join Inner, Some((int#15 = int#9))
   :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
   +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}

  was:
If applicable, we can push Inner Join through Outer Join. This can reduce the 
number of rows since the Inner Join always can generate less rows than Outer 
Join. 

This can improve the performance.


> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
>                 Key: SPARK-12602
>                 URL: https://issues.apache.org/jira/browse/SPARK-12602
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Xiao Li
>            Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. This can reduce the 
> number of rows since the Inner Join always can generate less rows than Outer 
> Join. 
> This PR is to push `Inner Join` through `Left/Right Outer Join`. The 
> reordering can reduce the number of processed rows since the `Inner Join` 
> always can generate less rows than `Left/Right Outer Join`. 
> This PR can improve the query performance, if applicable.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
> $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan should be like
> {code}
> Join RightOuter, Some((int#3 = int#9))
> :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> +- Join Inner, Some((int#15 = int#9))
>    :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
>    +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to