Chenyu Zheng created SPARK-53937:
------------------------------------

             Summary: SparkSQL partition overwrite is not an atomic operation.
                 Key: SPARK-53937
                 URL: https://issues.apache.org/jira/browse/SPARK-53937
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.1
            Reporter: Chenyu Zheng


I found that SparkSQL partition overwrite is not an atomic operation. When 
SparkSQL application rewrites existing table partition, [delete the matching 
partition](https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136),
 then run the job.

During task execution, the Hive partition is not deleted, but the data in the 
filesystem is deleted. If you read data while the task is running, you will 
read empty data. If the task is interrupted, the data will be lost.

Note: Only for spark.sql.hive.convertMetastoreOrc or 
spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive 
serde.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to