[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luis updated SPARK-21698: ------------------------- Description: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---+----+-----+ | id|name|count| +---+----+-----+ | 1| 1| 1| | 2| 2| 2| | 3| 3| 3| +---+----+-----+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---+----+-----+ | id|name|count| +---+----+-----+ | 1| 1| 1| | 2| 2| 2| | 3| 3| 3| | 4| 4| 4| | 5| 5| 5| | 6| 6| 6| +---+----+-----+ +---+----+-----+ | id|name|count| +---+----+-----+ | 1| 1| 1| | 2| 2| 2| | 3| 3| 3| | 4| 4| 4| | 4| 4| 4| | 5| 5| 5| | 5| 5| 5| | 6| 6| 6| | 6| 6| 6| +---+----+-----+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data1) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} was: Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem. h4. rogram output output : {code:title=Program Output|borderStyle=solid} 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," +---+----+-----+ | id|name|count| +---+----+-----+ | 1| 1| 1| | 2| 2| 2| | 3| 3| 3| +---+----+-----+ 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data 17/08/10 16:05:07 WARN log: Updated size to 545 +---+----+-----+ | id|name|count| +---+----+-----+ | 1| 1| 1| | 2| 2| 2| | 3| 3| 3| | 4| 4| 4| | 5| 5| 5| | 6| 6| 6| +---+----+-----+ +---+----+-----+ | id|name|count| +---+----+-----+ | 1| 1| 1| | 2| 2| 2| | 3| 3| 3| | 4| 4| 4| | 4| 4| 4| | 5| 5| 5| | 5| 5| 5| | 6| 6| 6| | 6| 6| 6| +---+----+-----+ {code} In the last show(). I see the data isn't what I would expect. {code:title=spark init|borderStyle=solid} self.spark = SparkSession \ .builder \ .master("spark://localhost:7077") \ .enableHiveSupport() \ .getOrCreate() {code} {code:title=Code for the test case|borderStyle=solid} def test_clean_insert_table(self): table_name = "data" data0 = [ {"id": 1, "name":"1", "count": 1}, {"id": 2, "name":"2", "count": 2}, {"id": 3, "name":"3", "count": 3}, ] df_data0 = self.spark.createDataFrame(data0) df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) df_return = self.spark.read.table(table_name) df_return.show() data1 = [ {"id": 4, "name":"4", "count": 4}, {"id": 5, "name":"5", "count": 5}, {"id": 6, "name":"6", "count": 6}, ] df_data1 = self.spark.createDataFrame(data1) df_data1.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() data3 = [ {"id": 1, "name":"one", "count":7}, {"id": 2, "name":"two", "count": 8}, {"id": 4, "name":"three", "count": 9}, {"id": 6, "name":"six", "count":10} ] data3 = self.spark.createDataFrame(data1) data3.write.insertInto(table_name) df_return = self.spark.read.table(table_name) df_return.show() {code} > write.partitionBy() is giving me garbage data > --------------------------------------------- > > Key: SPARK-21698 > URL: https://issues.apache.org/jira/browse/SPARK-21698 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.1, 2.2.0 > Environment: Linux Ubuntu 17.04. Python 3.5. > Reporter: Luis > > Spark partionBy is causing some data corruption. I am doing three super > simple writes. . Below is the code to reproduce the problem. > {code:title=Program Output|borderStyle=solid} > 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test > /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring > schema from dict is deprecated,please use pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > +---+----+-----+ > > | id|name|count| > +---+----+-----+ > | 1| 1| 1| > | 2| 2| 2| > | 3| 3| 3| > +---+----+-----+ > 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data > 17/08/10 16:05:07 WARN log: Updated size to 545 > 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data > 17/08/10 16:05:07 WARN log: Updated size to 545 > 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data > 17/08/10 16:05:07 WARN log: Updated size to 545 > +---+----+-----+ > | id|name|count| > +---+----+-----+ > | 1| 1| 1| > | 2| 2| 2| > | 3| 3| 3| > | 4| 4| 4| > | 5| 5| 5| > | 6| 6| 6| > +---+----+-----+ > +---+----+-----+ > | id|name|count| > +---+----+-----+ > | 1| 1| 1| > | 2| 2| 2| > | 3| 3| 3| > | 4| 4| 4| > | 4| 4| 4| > | 5| 5| 5| > | 5| 5| 5| > | 6| 6| 6| > | 6| 6| 6| > +---+----+-----+ > {code} > In the last show(). I see the data isn't what I would expect. > {code:title=spark init|borderStyle=solid} > self.spark = SparkSession \ > .builder \ > .master("spark://localhost:7077") \ > .enableHiveSupport() \ > .getOrCreate() > {code} > {code:title=Code for the test case|borderStyle=solid} > def test_clean_insert_table(self): > table_name = "data" > data0 = [ > {"id": 1, "name":"1", "count": 1}, > {"id": 2, "name":"2", "count": 2}, > {"id": 3, "name":"3", "count": 3}, > ] > df_data0 = self.spark.createDataFrame(data0) > > df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name) > df_return = self.spark.read.table(table_name) > df_return.show() > data1 = [ > {"id": 4, "name":"4", "count": 4}, > {"id": 5, "name":"5", "count": 5}, > {"id": 6, "name":"6", "count": 6}, > ] > df_data1 = self.spark.createDataFrame(data1) > df_data1.write.insertInto(table_name) > df_return = self.spark.read.table(table_name) > df_return.show() > data3 = [ > {"id": 1, "name":"one", "count":7}, > {"id": 2, "name":"two", "count": 8}, > {"id": 4, "name":"three", "count": 9}, > {"id": 6, "name":"six", "count":10} > ] > data3 = self.spark.createDataFrame(data1) > data3.write.insertInto(table_name) > df_return = self.spark.read.table(table_name) > df_return.show() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org