Gengliang Wang created SPARK-25313:
--------------------------------------
Summary: Fix regression in FileFormatWriter output schema
Key: SPARK-25313
URL: https://issues.apache.org/jira/browse/SPARK-25313
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang
et's see the follow example:
val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()
The output column name in schema will be id instead of ID, thus the last query
shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from
ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is
changed in RemoveRedundantAliases.
To guarantee correctness, we should change the output columns from
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by
optimizer.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]