[ 
https://issues.apache.org/jira/browse/SPARK-54003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-54003:
-----------------------------------
    Labels: pull-request-available  (was: )

> Use the staging directory as the output path then move to final path
> --------------------------------------------------------------------
>
>                 Key: SPARK-54003
>                 URL: https://issues.apache.org/jira/browse/SPARK-54003
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Chenyu Zheng
>            Priority: Major
>              Labels: pull-request-available
>
> SparkSQL uses the partition location or table location as the commit path 
> (except in *_dynamic partition overwrite_* mode and *_custom partition path_* 
> mode). This has at least the following issues:
> * As described in SPARK-37210, conflicts can occur when multiple partitions 
> job of the same table are run concurrently. Using a staging directory can 
> avoid this issue.
> * As described in SPARK-53937, using a staging directory allows for 
> near-atomic operations.
> _*Dynamic partition overwrite*_ mode and *_custom partition path_* mode 
> already use the staging directory. And *_dynamic partition overwrite_* mode 
> and _*custom partition path*_ are implemented differently, which can be 
> further simplified into a unified process. And in 
> https://github.com/apache/spark/pull/29000, reset the staging directory as 
> the output directory of FileOutputCommitter. This way is more safer. It 
> should be modified to this way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to