[
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729193#comment-17729193
]
Vaibhav Beriwala commented on SPARK-43106:
------------------------------------------
[~dongjoon] Did you get a chance to take a look at this?
Any feedback from you would be really helpful and as [~itskals] mentioned, we
at Uber are willing to work on this - just looking out for any gotchas as to
why the idea proposed in SPARK-19183 was not taken forward.
> Data lost from the table if the INSERT OVERWRITE query fails
> ------------------------------------------------------------
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.4.0
> Reporter: Vaibhav Beriwala
> Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3,
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data
> from the original table is lost.
> 2) If the insert job in step 2 above takes a huge time to complete, then
> table data is unavailable to other readers for the entire duration the job
> takes.
> This behavior is the same even for the partitioned tables when using static
> partitioning. For dynamic partitioning, we do not delete the table data
> before the job launch.
>
> Is there a reason as to why we perform this delete before the job launch and
> not as part of the Job commit operation? This issue is not there with Hive -
> where the data is cleaned up as part of the Job commit operation probably. As
> part of SPARK-19183, we did add a new hook in the commit protocol for this
> exact same purpose, but seems like its default behavior is still to delete
> the table data before the job launch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]