[
https://issues.apache.org/jira/browse/SPARK-53937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chenyu Zheng updated SPARK-53937:
---------------------------------
Description:
I found that SparkSQL partition overwrite is not an atomic operation. When
SparkSQL application rewrites existing table partition, [delete the matching
partition]([https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136]),
then run the job.
During task execution, the Hive partition is not deleted, but the data in the
filesystem is deleted. If you read data while the job is running, you will read
empty data. If the job is interrupted, the data will be lost.
Note: Only for spark.sql.hive.convertMetastoreOrc or
spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive
serde.
was:
I found that SparkSQL partition overwrite is not an atomic operation. When
SparkSQL application rewrites existing table partition, [delete the matching
partition](https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136),
then run the job.
During task execution, the Hive partition is not deleted, but the data in the
filesystem is deleted. If you read data while the task is running, you will
read empty data. If the task is interrupted, the data will be lost.
Note: Only for spark.sql.hive.convertMetastoreOrc or
spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive
serde.
> SparkSQL partition overwrite is not an atomic operation.
> --------------------------------------------------------
>
> Key: SPARK-53937
> URL: https://issues.apache.org/jira/browse/SPARK-53937
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.5.1
> Reporter: Chenyu Zheng
> Priority: Major
>
> I found that SparkSQL partition overwrite is not an atomic operation. When
> SparkSQL application rewrites existing table partition, [delete the matching
> partition]([https://github.com/apache/spark/blob/24a6abf34d253162055c8b9bd0030bf9a2ca75b1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L136]),
> then run the job.
> During task execution, the Hive partition is not deleted, but the data in the
> filesystem is deleted. If you read data while the job is running, you will
> read empty data. If the job is interrupted, the data will be lost.
> Note: Only for spark.sql.hive.convertMetastoreOrc or
> spark.sql.hive.convertMetastoreParquet is true. It is not problem for hive
> serde.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]