Github user fangshil commented on a diff in the pull request:
https://github.com/apache/spark/pull/20931#discussion_r179517200
--- Diff:
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
---
@@ -186,7 +186,9 @@ class HadoopMapReduceCommitProtocol(
logDebug(s"Clean up default partition directories for overwriting:
$partitionPaths")
for (part <- partitionPaths) {
val finalPartPath = new Path(path, part)
- fs.delete(finalPartPath, true)
+ if (!fs.delete(finalPartPath, true) &&
!fs.exists(finalPartPath.getParent)) {
--- End diff --
@cloud-fan this is to follow the behavior of HDFS rename spec: it requires
the parent to be present. If we create finalPartPath directly, then it will
cause another wired behavior in rename when the dst path already exists. From
the HDFS spec I shared above: " If the destination exists and is a directory,
the final destination of the rename becomes the destination + the filename of
the source path". We have confirmed this in our production cluster, and
resulted in the current solution to only create parent dir which follows the
HDFS spec exactly,
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]