GitHub user stanzhai opened a pull request:
https://github.com/apache/spark/pull/17099
Constant alias columns in INNER JOIN should not be folded by
FoldablePropagation rule
## What changes were proposed in this pull request?
This PR fixes the code in Optimizer phase where the constant alias columns
of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.
For the following query():
```
val sqlA =
"""
|create temporary view ta as
|select a, 'a' as tag from t1 union all
|select a, 'b' as tag from t2
""".stripMargin
val sqlB =
"""
|create temporary view tb as
|select a, 'a' as tag from t3 union all
|select a, 'b' as tag from t4
""".stripMargin
val sql =
"""
|select tb.* from ta inner join tb on
|ta.a = tb.a and
|ta.tag = tb.tag
""".stripMargin
```
The tag column is an constant alias column, it's folded by
`FoldablePropagation` like this:
```
TRACE SparkOptimizer:
=== Applying Rule
org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
Project [a#4, tag#14] Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, ((a#0 =
a#4) && (a = a))
:- Union :- Union
: :- Project [a#0, a AS tag#8] : :- Project [a#0,
a AS tag#8]
: : +- LocalRelation [a#0] : : +-
LocalRelation [a#0]
: +- Project [a#2, b AS tag#9] : +- Project [a#2,
b AS tag#9]
: +- LocalRelation [a#2] : +-
LocalRelation [a#2]
+- Union +- Union
:- Project [a#4, a AS tag#14] :- Project [a#4,
a AS tag#14]
: +- LocalRelation [a#4] : +-
LocalRelation [a#4]
+- Project [a#6, b AS tag#15] +- Project [a#6,
b AS tag#15]
+- LocalRelation [a#6] +-
LocalRelation [a#6]
```
Finally the Result of Batch Operator Optimizations is:
```
Project [a#4, tag#14] Project [a#4, tag#14]
!+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, (a#0 =
a#4)
! :- SubqueryAlias ta, `ta` :- Union
! : +- Union : :- LocalRelation
[a#0]
! : :- Project [a#0, a AS tag#8] : +- LocalRelation
[a#2]
! : : +- SubqueryAlias t1, `t1` +- Union
! : : +- Project [a#0] :- LocalRelation
[a#4, tag#14]
! : : +- SubqueryAlias grouping +- LocalRelation
[a#6, tag#15]
! : : +- LocalRelation [a#0]
! : +- Project [a#2, b AS tag#9]
! : +- SubqueryAlias t2, `t2`
! : +- Project [a#2]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#2]
! +- SubqueryAlias tb, `tb`
! +- Union
! :- Project [a#4, a AS tag#14]
! : +- SubqueryAlias t3, `t3`
! : +- Project [a#4]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#4]
! +- Project [a#6, b AS tag#15]
! +- SubqueryAlias t4, `t4`
! +- Project [a#6]
! +- SubqueryAlias grouping
! +- LocalRelation [a#6]
```
The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads
to the data of inner join being wrong.
After fix:
```
=== Result of Batch LocalRelation ===
GlobalLimit 21 GlobalLimit 21
+- LocalLimit 21 +- LocalLimit 21
+- Project [a#4, tag#11] +- Project
[a#4, tag#11]
+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) +- Join
Inner, ((a#0 = a#4) && (tag#8 = tag#11))
! :- SubqueryAlias ta :- Union
! : +- Union : :-
LocalRelation [a#0, tag#8]
! : :- Project [a#0, a AS tag#8] : +-
LocalRelation [a#2, tag#9]
! : : +- SubqueryAlias t1 +- Union
! : : +- Project [a#0] :-
LocalRelation [a#4, tag#11]
! : : +- SubqueryAlias grouping +-
LocalRelation [a#6, tag#12]
! : : +- LocalRelation [a#0]
! : +- Project [a#2, b AS tag#9]
! : +- SubqueryAlias t2
! : +- Project [a#2]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#2]
! +- SubqueryAlias tb
! +- Union
! :- Project [a#4, a AS tag#11]
! : +- SubqueryAlias t3
! : +- Project [a#4]
! : +- SubqueryAlias grouping
! : +- LocalRelation [a#4]
! +- Project [a#6, b AS tag#12]
! +- SubqueryAlias t4
! +- Project [a#6]
! +- SubqueryAlias grouping
! +- LocalRelation [a#6]
```
## How was this patch tested?
add sql-tests/inputs/inner-join.sql
All tests passed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/stanzhai/spark fix-inner-join
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17099.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17099
----
commit 02b9dd6b8c3eb49bb8e6e537d0432c1eff35baad
Author: Stan Zhai <[email protected]>
Date: 2017-02-28T12:44:14Z
fix inner join
commit 112dd2379bf9febdc9ed81925326b61d2a34efdd
Author: Stan Zhai <[email protected]>
Date: 2017-02-28T12:55:33Z
fix inner-join.sql.out
commit 44636483bb1b87d7e4746ae98df47b3f9dc7e8ce
Author: Stan Zhai <[email protected]>
Date: 2017-02-28T13:13:23Z
update test
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]