GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/22311

    [SPARK-25305][SQL] Respect attribute name in CollapseProject and 
ColumnPruning

    ## What changes were proposed in this pull request?
    
    Currently in optimizer rule `CollapseProject`, the lower level project is 
collapsed into upper level, but the naming of alias in lower level is 
propagated in upper level.
    In `ColumnPruning`,  `Project` is eliminated if its child's output 
attributes is `semanticEquals` to it, even the naming doesn't match.
    
    Let's see the follow example:
    ```
            val location = "/tmp/t"
            val df = spark.range(10).toDF("id")
            df.write.format("parquet").saveAsTable("tbl")
            spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
            spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
            spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
            println(spark.read.parquet(location).schema)
            spark.table("tbl2").show()
    ```
    The output column name in schema will be `id` instead of `ID`, thus the 
last query shows nothing from `tbl2`. 
    By enabling the debug message we can see that the output naming is changed 
from `ID` to `id`, and then the `outputColumns` in 
`InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.
    
![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg)
    
    
![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg)
    
    With the fix proposed in this PR, the output naming `ID` won't be changed.
    
![wechatimg3](https://user-images.githubusercontent.com/1097932/44947923-1c915e00-ae47-11e8-9a04-0d60b65dd1f1.jpeg)
    
    
    ## How was this patch tested?
    
    Unit test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark fixEliminateProject

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22311.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22311
    
----
commit f94fdf7fd74a75c777b5b38ce970e0742d00091c
Author: Gengliang Wang <gengliang.wang@...>
Date:   2018-09-01T15:17:04Z

    Fix ColumnPruning and CollapseProject on eliminating Project

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to