[
https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin updated SPARK-12602:
--------------------------------
Component/s: Optimizer
> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
> Key: SPARK-12602
> URL: https://issues.apache.org/jira/browse/SPARK-12602
> Project: Spark
> Issue Type: Improvement
> Components: Optimizer, SQL
> Affects Versions: 1.6.0
> Reporter: Xiao Li
> Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. The basic idea is
> built on the associativity property of outer and inner joins:
> {code}
> R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
> R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 =
> (R1 inner R3 on p13) left R2 on p23
> (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
> (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 =
> (R2 inner R3 on p23) left R1 on p12
> {code}
> The reordering can reduce the number of processed rows since the Inner Join
> always can generate less (or equivalent) rows than Left/Right Outer Join.
> This change can improve the query performance in most cases.
> When cost-based optimization is available, we can switch the order of tables
> in each join type based on their costs. The order of joined tables in the
> inner join does not affect the results and the right outer join can be
> changed to the left outer join. This part is out of scope here.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" ===
> $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> : :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> : +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan is like
> {code}
> Join LeftOuter, Some((int#3 = int#9))
> :- Join Inner, Some((int#15 = int#9))
> : :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> : +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> +- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]