GitHub user gengliangwang opened a pull request:
https://github.com/apache/spark/pull/22320
[SPARK-25313][SQL]Fix regression in FileFormatWriter output names
## What changes were proposed in this pull request?
Let's see the follow example:
```
val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()
```
The output column name in schema will be `id` instead of `ID`, thus the
last query shows nothing from `tbl2`.
By enabling the debug message we can see that the output naming is changed
from `ID` to `id`, and then the `outputColumns` in
`InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.


**To guarantee correctness**, we should change the output columns from
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by
optimizer.
I will fix project elimination related rules in
https://github.com/apache/spark/pull/22311 after this one.
## How was this patch tested?
Unit test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gengliangwang/spark fixOutputSchema
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22320.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22320
----
commit bbd572c1fe542c6b2fd642212f927ba384c882e4
Author: Gengliang Wang <gengliang.wang@...>
Date: 2018-08-31T16:07:00Z
Fix regression in FileFormatWriter output schema
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]