symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r813039532



##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -236,7 +236,9 @@ class HadoopMapReduceCommitProtocol(
         }
       }
 
-      fs.delete(stagingDir, true)
+      if (dynamicPartitionOverwrite || filesToMove.nonEmpty) {
+        fs.delete(stagingDir, true)

Review comment:
       Thanks for the reply. I'll try to consult from HDFS community if there's 
an official document about this "check before write" operation.
   
   In my current scenario, the cons are 
   1. there will be a warning log when writing to Alluxio,
   2. since there's no log returning from HDFS, the user won't even notice that 
he's deleting a non-exist file, which may incur some other mistakes, say I want 
to check the size of ".spark-staging-xxx" or something else, but the directory 
doesn't exist at all. 
   3. the performance overhead mentioned above about the NameNode Write lock.
   
   For the question about why doesn't fs.delete check the existence on the 
server side, I think the following ideas might be related. 
   1. The succinct interface in FileSystem. So that "fs.delete" only do delete.
   2. HDFS Client or other clients already do the check, like when you running 
"hadoop fs -rmr hdfs://xxx/xxx", it will check first before really calling 
"fs.delete". So if another "fs.exists" was added in "fs.delete", for client 
already checked before, there will be 2 * "fs.exists" + 1 * "fs.delete" for 
them.
   3. I think it reminds me of the story of the King afraid of dirty feet asks 
to sweep all the ground of the country, in fact, the only thing the king needs 
to do is to wear shoes. Similarly, for users of FileSystem, maybe some 
FileSystems do check the existence before deleting like Alluxio, but, IMHO, we 
can't ask all the FileSystem to do the same, it's better to do the check on 
users' side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to