GitHub user stanzhai opened a pull request:

    https://github.com/apache/spark/pull/17099

    Constant alias columns in INNER JOIN should not be folded by 
FoldablePropagation rule

    ## What changes were proposed in this pull request?
    This PR fixes the code in Optimizer phase where the constant alias columns 
of a `INNER JOIN` query are folded in Rule `FoldablePropagation`.
    
    For the following query():
    
    ```
    val sqlA =
      """
        |create temporary view ta as
        |select a, 'a' as tag from t1 union all
        |select a, 'b' as tag from t2
      """.stripMargin
    
    val sqlB =
      """
        |create temporary view tb as
        |select a, 'a' as tag from t3 union all
        |select a, 'b' as tag from t4
      """.stripMargin
    
    val sql =
      """
        |select tb.* from ta inner join tb on
        |ta.a = tb.a and
        |ta.tag = tb.tag
      """.stripMargin
    ```
    
    The tag column is an constant alias column, it's folded by 
`FoldablePropagation` like this:
    
    ```
    TRACE SparkOptimizer: 
    === Applying Rule 
org.apache.spark.sql.catalyst.optimizer.FoldablePropagation ===
     Project [a#4, tag#14]                              Project [a#4, tag#14]
    !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, ((a#0 = 
a#4) && (a = a))
        :- Union                                           :- Union
        :  :- Project [a#0, a AS tag#8]                    :  :- Project [a#0, 
a AS tag#8]
        :  :  +- LocalRelation [a#0]                       :  :  +- 
LocalRelation [a#0]
        :  +- Project [a#2, b AS tag#9]                    :  +- Project [a#2, 
b AS tag#9]
        :     +- LocalRelation [a#2]                       :     +- 
LocalRelation [a#2]
        +- Union                                           +- Union
           :- Project [a#4, a AS tag#14]                      :- Project [a#4, 
a AS tag#14]
           :  +- LocalRelation [a#4]                          :  +- 
LocalRelation [a#4]
           +- Project [a#6, b AS tag#15]                      +- Project [a#6, 
b AS tag#15]
              +- LocalRelation [a#6]                             +- 
LocalRelation [a#6]
    ```
    
    Finally the Result of Batch Operator Optimizations is:
    
    ```
    Project [a#4, tag#14]                              Project [a#4, tag#14]
    !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14))   +- Join Inner, (a#0 = 
a#4)
    !   :- SubqueryAlias ta, `ta`                          :- Union
    !   :  +- Union                                        :  :- LocalRelation 
[a#0]
    !   :     :- Project [a#0, a AS tag#8]                 :  +- LocalRelation 
[a#2]
    !   :     :  +- SubqueryAlias t1, `t1`                 +- Union
    !   :     :     +- Project [a#0]                          :- LocalRelation 
[a#4, tag#14]
    !   :     :        +- SubqueryAlias grouping              +- LocalRelation 
[a#6, tag#15]
    !   :     :           +- LocalRelation [a#0]        
    !   :     +- Project [a#2, b AS tag#9]              
    !   :        +- SubqueryAlias t2, `t2`              
    !   :           +- Project [a#2]                    
    !   :              +- SubqueryAlias grouping        
    !   :                 +- LocalRelation [a#2]        
    !   +- SubqueryAlias tb, `tb`                       
    !      +- Union                                     
    !         :- Project [a#4, a AS tag#14]             
    !         :  +- SubqueryAlias t3, `t3`              
    !         :     +- Project [a#4]                    
    !         :        +- SubqueryAlias grouping        
    !         :           +- LocalRelation [a#4]        
    !         +- Project [a#6, b AS tag#15]             
    !            +- SubqueryAlias t4, `t4`              
    !               +- Project [a#6]                    
    !                  +- SubqueryAlias grouping        
    !                     +- LocalRelation [a#6]    
    ```
    
    The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads 
to the data of inner join being wrong.
    
    After fix:
    
    ```
    === Result of Batch LocalRelation ===
     GlobalLimit 21                                           GlobalLimit 21
     +- LocalLimit 21                                         +- LocalLimit 21
        +- Project [a#4, tag#11]                                 +- Project 
[a#4, tag#11]
           +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11))         +- Join 
Inner, ((a#0 = a#4) && (tag#8 = tag#11))
    !         :- SubqueryAlias ta                                      :- Union
    !         :  +- Union                                              :  :- 
LocalRelation [a#0, tag#8]
    !         :     :- Project [a#0, a AS tag#8]                       :  +- 
LocalRelation [a#2, tag#9]
    !         :     :  +- SubqueryAlias t1                             +- Union
    !         :     :     +- Project [a#0]                                :- 
LocalRelation [a#4, tag#11]
    !         :     :        +- SubqueryAlias grouping                    +- 
LocalRelation [a#6, tag#12]
    !         :     :           +- LocalRelation [a#0]        
    !         :     +- Project [a#2, b AS tag#9]              
    !         :        +- SubqueryAlias t2                    
    !         :           +- Project [a#2]                    
    !         :              +- SubqueryAlias grouping        
    !         :                 +- LocalRelation [a#2]        
    !         +- SubqueryAlias tb                             
    !            +- Union                                     
    !               :- Project [a#4, a AS tag#11]             
    !               :  +- SubqueryAlias t3                    
    !               :     +- Project [a#4]                    
    !               :        +- SubqueryAlias grouping        
    !               :           +- LocalRelation [a#4]        
    !               +- Project [a#6, b AS tag#12]             
    !                  +- SubqueryAlias t4                    
    !                     +- Project [a#6]                    
    !                        +- SubqueryAlias grouping        
    !                           +- LocalRelation [a#6]  
    ```
    
    ## How was this patch tested?
    
    add sql-tests/inputs/inner-join.sql
    All tests passed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/stanzhai/spark fix-inner-join

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17099.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17099
    
----
commit 02b9dd6b8c3eb49bb8e6e537d0432c1eff35baad
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-02-28T12:44:14Z

    fix inner join

commit 112dd2379bf9febdc9ed81925326b61d2a34efdd
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-02-28T12:55:33Z

    fix inner-join.sql.out

commit 44636483bb1b87d7e4746ae98df47b3f9dc7e8ce
Author: Stan Zhai <zhaishi...@haizhi.com>
Date:   2017-02-28T13:13:23Z

    update test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to