[ 
https://issues.apache.org/jira/browse/SPARK-53937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyu Zheng updated SPARK-53937:
---------------------------------
    Description: 
I found that SparkSQL partition overwrite is not an atomic operation. When 
SparkSQL application rewrites existing table partition, [delete the matching 
partition]([https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136]),
 then run the job.

During task execution, the Hive partition is not deleted, but the data in the 
filesystem is deleted. If you read data while the job is running, you will read 
empty data. If the job is interrupted, the data will be lost.

Note: Only for spark.sql.hive.convertMetastoreOrc or 
spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive 
serde.

  was:
I found that SparkSQL partition overwrite is not an atomic operation. When 
SparkSQL application rewrites existing table partition, [delete the matching 
partition](https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136),
 then run the job.

During task execution, the Hive partition is not deleted, but the data in the 
filesystem is deleted. If you read data while the task is running, you will 
read empty data. If the task is interrupted, the data will be lost.

Note: Only for spark.sql.hive.convertMetastoreOrc or 
spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive 
serde.


> SparkSQL partition overwrite is not an atomic operation.
> --------------------------------------------------------
>
>                 Key: SPARK-53937
>                 URL: https://issues.apache.org/jira/browse/SPARK-53937
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.1
>            Reporter: Chenyu Zheng
>            Priority: Major
>
> I found that SparkSQL partition overwrite is not an atomic operation. When 
> SparkSQL application rewrites existing table partition, [delete the matching 
> partition]([https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136]),
>  then run the job.
> During task execution, the Hive partition is not deleted, but the data in the 
> filesystem is deleted. If you read data while the job is running, you will 
> read empty data. If the job is interrupted, the data will be lost.
> Note: Only for spark.sql.hive.convertMetastoreOrc or 
> spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive 
> serde.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to