zhengchenyu opened a new pull request, #52720: URL: https://github.com/apache/spark/pull/52720
### What changes were proposed in this pull request? The key modifications are as follows: * Force to use staging directory in `SQLHadoopMapReduceCommitProtocol` and perform a rename from the source directory to the destination directory during the commit job phase. `Dynamic partition overwrite` and `custom partition path` have also been integrated into this process. > Note: `SQLHadoopMapReduceCommitProtocol` performs directory-based renames, because table operations are generally directory-based. For `HadoopMapReduceCommitProtocol`, I believe the logic for renaming directories should be removed, while the logic for renaming files should be retained (although it is not used). * Avoid deleting partitions before task runs and implement dynamic overwrite in `SQLHadoopMapReduceCommitProtocol`. To maintain compatibility with static mode, the corresponding partition files need to be deleted during `refreshUpdatedPartitions`. * Handle paths according to `SaveMode` in `SQLHadoopMapReduceCommitProtocol`. > Note: For ease of review, some code in `HadoopMapReduceCommitProtocol` has been retained. In fact, I think the parameter `dynamicPartitionOverwrite` and the code for renaming partition directories during the commit job phase are no longer meaningful and should be removed. ### Why are the changes needed? SparkSQL uses the partition location or table location as the commit path (except in `dynamic partition overwrite` mode and `custom partition path` mode). This has at least the following issues: * As described in [SPARK-37210](https://issues.apache.org/jira/browse/SPARK-37210), conflicts can occur when multiple partitions job of the same table are run concurrently. Using a staging directory can avoid this issue. * As described in [SPARK-53937](https://issues.apache.org/jira/browse/SPARK-53937), using a staging directory allows for near-atomic operations. `Dynamic partition overwrite` mode and `custom partition path` mode already use the staging directory. And `dynamic partition overwrite` mode and `custom partition path` are implemented differently, which can be simplified into a unified process. And in https://github.com/apache/spark/pull/29000, reset the staging directory as the output directory of FileOutputCommitter. This way is more safer. It should be modified to this way. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests and newly added unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
