Chenyu Zheng created SPARK-54003:
------------------------------------

             Summary: Use the staging directory as the output directory before 
job commit
                 Key: SPARK-54003
                 URL: https://issues.apache.org/jira/browse/SPARK-54003
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.0
            Reporter: Chenyu Zheng


SparkSQL uses the partition location or table location as the commit path 
(except in *_dynamic partition overwrite_* mode and *_custom partition path_* 
mode). This has at least the following issues:

* As described in SPARK-37210, conflicts can occur when multiple partitions job 
of the same table are run concurrently. Using a staging directory can avoid 
this issue.
* As described in SPARK-53937, using a staging directory allows for near-atomic 
operations.

_*Dynamic partition overwrite*_ mode and *_custom partition path_* mode already 
use the staging directory. And *_dynamic partition overwrite_* mode and 
_*custom partition path*_ are implemented differently, which can be further 
simplified into a unified process. And in 
https://github.com/apache/spark/pull/29000, reset the staging directory as the 
output directory of FileOutputCommitter. This way is more safer. It should be 
modified to this way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to