[jira] [Commented] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema

Cheng Lian (JIRA) Tue, 12 Apr 2016 11:25:50 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237690#comment-15237690
 ]


Cheng Lian commented on SPARK-14566:
------------------------------------

This bug is exposed after fixing SPARK-14458.

These two bugs together happened to cheated all our existing test cases.

> When appending to partitioned persisted table, we should apply a projection 
> over input query plan using existing metastore schema
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14566
>                 URL: https://issues.apache.org/jira/browse/SPARK-14566
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>
> Take the following snippets slightly modified from test case 
> "SQLQuerySuite.SPARK-11453: append data to partitioned table" as an example:
> {code}
> val df1 = Seq("1" -> "10", "2" -> "20").toDF("i", "j")
> df1.write.partitionBy("i").saveAsTable("tbl11453")
> val df2 = Seq("3" -> "30").toDF("i", "j")
> df2.write.mode(SaveMode.Append).partitionBy("i").saveAsTable("tbl11453")
> {code}
> Although {{df1.schema}} is {{<i:STRING, j:STRING>}}, schema of persisted 
> table {{tbl11453}} is actually {{<j:STRING, i:STRING>}} because {{i}} is a 
> partition column, which is always appended after all data columns. Thus, when 
> appending {{df2}}, schemata of {{df2}} and persisted table {{tbl11453}} are 
> actually different.
> In current master branch, {{CreateMetastoreDataSourceAsSelect}} simply 
> applies existing metastore schema to the input query plan ([see 
> here|https://github.com/apache/spark/blob/75e05a5a964c9585dd09a2ef6178881929bab1f1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala#L225]),
>  which is wrong. A projection should be used instead to adjust column order 
> here.
> In branch-1.6, [this projection is added in 
> {{InsertIntoHadoopFsRelation}}|https://github.com/apache/spark/blob/663a492f0651d757ea8e5aeb42107e2ece429613/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L99-L104],
>  but was removed in Spark 2.0. Replacing the aforementioned line in 
> {{CreateMetastoreDataSourceAsSelect}} with a projection should more 
> preferrable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14566) When appending to partitioned persisted table, we should apply a projection over input query plan using existing metastore schema

Reply via email to