[ 
https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12602:
----------------------------
    Description: 
If applicable, we can push Inner Join through Outer Join. The basic idea is 
built on the associativity property of outer and inner joins:
{code}
R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = 
(R1 inner R3 on p13) left R2 on p23
(R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
(R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = 
(R2 inner R3 on p23) left R1 on p12
{code}

The reordering can reduce the number of processed rows since the Inner Join 
always can generate less (or equivalent) rows than Left/Right Outer Join. This 
change can improve the query performance in most cases.

When cost-based optimization is available, we can switch the order of tables in 
each join type based on their costs. The order of joined tables in the inner 
join does not affect the results and the right outer join can be changed to the 
left outer join. This part is out of scope here.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan is like
{code}
Join LeftOuter, Some((int#3 = int#9))
:- Join Inner, Some((int#15 = int#9))
:  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
:  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
+- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
{code}

  was:
If applicable, we can push Inner Join through Outer Join. 

The reordering can reduce the number of processed rows since the `Inner Join` 
always can generate less rows than `Left/Right Outer Join`. Thus, it can 
improve the query performance.

For example, given the following eligible query:
{code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
$"b.int", "inner"){code}

Before the fix, the logical plan is like
{code}
Join Inner, Some((int#15 = int#9))
:- Join RightOuter, Some((int#3 = int#9))
:  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
:  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
+- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}
After the fix, the logical plan should be like
{code}
Join RightOuter, Some((int#3 = int#9))
:- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
+- Join Inner, Some((int#15 = int#9))
   :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
   +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
{code}


> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
>                 Key: SPARK-12602
>                 URL: https://issues.apache.org/jira/browse/SPARK-12602
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: Xiao Li
>            Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. The basic idea is 
> built on the associativity property of outer and inner joins:
> {code}
> R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
> R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = 
> (R1 inner R3 on p13) left R2 on p23
> (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
> (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = 
> (R2 inner R3 on p23) left R1 on p12
> {code}
> The reordering can reduce the number of processed rows since the Inner Join 
> always can generate less (or equivalent) rows than Left/Right Outer Join. 
> This change can improve the query performance in most cases.
> When cost-based optimization is available, we can switch the order of tables 
> in each join type based on their costs. The order of joined tables in the 
> inner join does not affect the results and the right outer join can be 
> changed to the left outer join. This part is out of scope here.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === 
> $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan is like
> {code}
> Join LeftOuter, Some((int#3 = int#9))
> :- Join Inner, Some((int#15 = int#9))
> :  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> :  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> +- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to