[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

Xiaoju Wu (JIRA) Thu, 22 Feb 2018 23:37:22 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374040#comment-16374040
 ]


Xiaoju Wu commented on SPARK-9278:
----------------------------------

Seems the issue still exists, here's the test:

val data = Seq(
 (7, "test1", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )
 import spark.implicits._

 val table = "default.tbl"
 spark
 .createDataset(data)
 .toDF("col1", "col2", "col3")
 .write
 .partitionBy("col1")
 .saveAsTable(table)

 val data2 = Seq(
 (7, "test2", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
 )

 spark
.createDataset(data2)
.toDF("col1", "col2", "col3")
 .write
.insertInto(table)

 sql("select * from " + table).show()

+---------+----+----+
| col2|col3|col1|
+---------+----+----+
|test#test| 0.0| 8|
| test1| 1.0| 7|
| test3| 0.0| 9|
| 8|null| 0|
| 9|null| 0|
| 7|null| 1|
+---------+----+----+

No exception was thrown since I only run insertInto not together with 
partitionBy. The data are inserted incorrectly. The issue is related to column 
order. If I change to partitionBy col3, which is the last column in order, it 
works.

val data = Seq(
 (7, "test1", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
)
import spark.implicits._

val table = "default.tbl"
spark
 .createDataset(data)
 .toDF("col1", "col2", "col3")
 .write
 .partitionBy("col3")
 .saveAsTable(table)

val data2 = Seq(
 (7, "test2", 1.0),
 (8, "test#test", 0.0),
 (9, "test3", 0.0)
)

spark
 .createDataset(data2)
 .toDF("col1", "col2", "col3")
 .write
 .insertInto(table)

sql("select * from " + table).show()

+----+---------+----+
|col1| col2|col3|
+----+---------+----+
| 8|test#test| 0.0|
| 9| test3| 0.0|
| 8|test#test| 0.0|
| 9| test3| 0.0|
| 7| test1| 1.0|
| 7| test2| 1.0|
+----+---------+----+

> DataFrameWriter.insertInto inserts incorrect data
> -------------------------------------------------
>
>                 Key: SPARK-9278
>                 URL: https://issues.apache.org/jira/browse/SPARK-9278
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Linux, S3, Hive Metastore
>            Reporter: Steve Lindemann
>            Assignee: Cheng Lian
>            Priority: Critical
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

Reply via email to