[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

Hyukjin Kwon (JIRA) Tue, 08 Dec 2015 17:21:51 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047830#comment-15047830
 ]


Hyukjin Kwon commented on SPARK-9278:
-------------------------------------

The result might be definitely different as I ran the codes below with master 
branch of Spark, local environment without S3, Scala API and Mac OS. Though, I 
will leave the comment about what I tested in case you might want to test 
without the environments.

Here the codes I ran,

{code}
  // Create data.
  val alphabets = Seq("a", "e", "i", "o", "u")
  val partA = (0 to 4).map(i => Seq(alphabets(i % 5), "a", i))
  val partB = (5 to 9).map(i => Seq(alphabets(i % 5), "b", i))
  val partC = (10 to 14).map(i => Seq(alphabets(i % 5), "c", i))
  val data = partA ++ partB ++ partC

  // Create RDD.
  val rowsRDD = sc.parallelize(data.map(Row.fromSeq))

  // Create Dataframe.
  val schema = StructType(List(
    StructField("k", StringType, true),
    StructField("pk", StringType, true),
    StructField("v", IntegerType, true))
  )
  val sdf = sqlContext.createDataFrame(rowsRDD, schema)

  // create a empty table.
  sdf.filter("FALSE")
    .write
    .format("parquet")
    .option("path", "foo")
    .partitionBy("pk")
    .saveAsTable("foo")

  // Save a partitioned table.
  sdf.filter("pk = 'a'")
    .write
    .partitionBy("pk")
    .insertInto("foo")

  // Select all.
  val foo = sqlContext.table("foo")
  foo.show()
{code} 

And the result was correct as below.

{code}
+---+---+---+
|  k|  v| pk|
+---+---+---+
|  a|  0|  a|
|  e|  1|  a|
|  i|  2|  a|
|  o|  3|  a|
|  u|  4|  a|
+---+---+---+
{code}

> DataFrameWriter.insertInto inserts incorrect data
> -------------------------------------------------
>
>                 Key: SPARK-9278
>                 URL: https://issues.apache.org/jira/browse/SPARK-9278
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Linux, S3, Hive Metastore
>            Reporter: Steve Lindemann
>            Assignee: Cheng Lian
>            Priority: Blocker
>
> After creating a partitioned Hive table (stored as Parquet) via the 
> DataFrameWriter.createTable command, subsequent attempts to insert additional 
> data into new partitions of this table result in inserting incorrect data 
> rows. Reordering the columns in the data to be written seems to avoid this 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

Reply via email to