[PR] [SPARK-54003][SQL] Use the staging directory as the output path then move to final path. [spark]

via GitHub Fri, 24 Oct 2025 00:18:18 -0700


zhengchenyu opened a new pull request, #52720:
URL: https://github.com/apache/spark/pull/52720


   ### What changes were proposed in this pull request?
   
   The key modifications are as follows: 
   
   * Force to use staging directory in `SQLHadoopMapReduceCommitProtocol` and 
perform a rename from the source directory to the destination directory during 
the commit job phase. `Dynamic partition overwrite` and `custom partition path` 
have also been integrated into this process.
   > Note: `SQLHadoopMapReduceCommitProtocol` performs directory-based renames, 
because table operations are generally directory-based. For 
`HadoopMapReduceCommitProtocol`, I believe the logic for renaming directories 
should be removed, while the logic for renaming files should be retained 
(although it is not used).
   
   * Avoid deleting partitions before task runs and implement dynamic overwrite 
in `SQLHadoopMapReduceCommitProtocol`. To maintain compatibility with static 
mode, the corresponding partition files need to be deleted during 
`refreshUpdatedPartitions`.
   
   * Handle paths according to `SaveMode` in `SQLHadoopMapReduceCommitProtocol`.
   
   > Note: For ease of review, some code in `HadoopMapReduceCommitProtocol` has 
been retained. In fact, I think the parameter `dynamicPartitionOverwrite` and 
the code for renaming partition directories during the commit job phase are no 
longer meaningful and should be removed.
   
   
   ### Why are the changes needed?
   
   SparkSQL uses the partition location or table location as the commit path 
(except in `dynamic partition overwrite` mode and `custom partition path` 
mode). This has at least the following issues:
   
   * As described in 
[SPARK-37210](https://issues.apache.org/jira/browse/SPARK-37210), conflicts can 
occur when multiple partitions job of the same table are run concurrently. 
Using a staging directory can avoid this issue.
   * As described in 
[SPARK-53937](https://issues.apache.org/jira/browse/SPARK-53937), using a 
staging directory allows for near-atomic operations.
   
   `Dynamic partition overwrite` mode and `custom partition path` mode already 
use the staging directory. And `dynamic partition overwrite` mode and `custom 
partition path` are implemented differently, which can be simplified into a 
unified process. And in https://github.com/apache/spark/pull/29000, reset the 
staging directory as the output directory of FileOutputCommitter. This way is 
more safer. It should be modified to this way.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing unit tests and newly added unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54003][SQL] Use the staging directory as the output path then move to final path. [spark]

Reply via email to