symious commented on a change in pull request #35569:
URL: https://github.com/apache/spark/pull/35569#discussion_r810921597



##########
File path: 
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
##########
@@ -305,4 +304,11 @@ class HadoopMapReduceCommitProtocol(
         logWarning(s"Exception while aborting 
${taskContext.getTaskAttemptID}", e)
     }
   }
+
+  private def cleanStagingDir(jobContext: JobContext): Unit = {
+    val fs = stagingDir.getFileSystem(jobContext.getConfiguration)
+    if (fs.exists(stagingDir)) {

Review comment:
       @cloud-fan Thanks for the review. 
   The overhead of new added RPC call of readFile is quite small compared to 
the unnecessary deletion of the nonexistent file. Since In NameNode, read lock 
is shared and write lock is exclusive. 
   
   And I think checking before existence is more of a client-side design, when 
we are running "hadoop fs -rm hdfs://ns/file", the existence is checked before 
real delete too.
   
   In fact, I think it may be better to add a variable to indicate if the 
stagingDirectory is created so that we don't need the first RPC to check if 
it's existence. @cloud-fan What do you think?
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to