[
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan Blue resolved SPARK-14543.
-------------------------------
Resolution: Later
This is addressed by SPARK-24251 for DataSourceV2 writers.
> SQL/Hive insertInto has unexpected results
> ------------------------------------------
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Ryan Blue
> Assignee: Ryan Blue
> Priority: Major
>
> *Updated description*
> There should be an option to match input data to output columns by name. The
> API allows operations on tables, which hide the column resolution problem.
> It's easy to copy from one table to another without listing the columns, and
> in the API it is common to work with columns by name rather than by position.
> I think the API should add a way to match columns by name, which is closer to
> what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile
> incoming data columns with the outgoing table schema. Columns are matched by
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran
> into this by not using a correct {{partitionBy}} call (addressed by
> SPARK-14459), which caused an error message that an int could not be cast to
> an array. However, if the columns are vaguely compatible, for example string
> and float, then no error or warning is produced and data is written to the
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes
> by adding a column in the middle of a table. Spark SQL statements that copied
> from that table to a destination table will then map the columns differently
> but insert casts that mask the problem. The last column's data will be
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and
> LogicalRelation
> Also, I think there should be an option to match input data to output columns
> by name. The API allows operations on tables, which hide the column
> resolution problem. It's easy to copy from one table to another without
> listing the columns, and in the API it is common to work with columns by name
> rather than by position. I think the API should add a way to match columns by
> name, which is closer to what users expect. I propose adding something like
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]