[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

gengliangwang Mon, 03 Sep 2018 00:19:00 -0700

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/22320


    [SPARK-25313][SQL]Fix regression in FileFormatWriter output names

    ## What changes were proposed in this pull request?
    
    Let's see the follow example:
    ```
            val location = "/tmp/t"
            val df = spark.range(10).toDF("id")
            df.write.format("parquet").saveAsTable("tbl")
            spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
            spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
            spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
            println(spark.read.parquet(location).schema)
            spark.table("tbl2").show()
    ```
    The output column name in schema will be `id` instead of `ID`, thus the 
last query shows nothing from `tbl2`. 
    By enabling the debug message we can see that the output naming is changed 
from `ID` to `id`, and then the `outputColumns` in 
`InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.
    
![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg)
    
    
![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg)
    
    **To guarantee correctness**, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.
    
    I will fix project elimination related rules in 
https://github.com/apache/spark/pull/22311 after this one.
    
    ## How was this patch tested?
    
    Unit test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark fixOutputSchema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22320.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22320
    
----
commit bbd572c1fe542c6b2fd642212f927ba384c882e4
Author: Gengliang Wang <gengliang.wang@...>
Date:   2018-08-31T16:07:00Z

    Fix regression in FileFormatWriter output schema

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

Reply via email to