symious commented on pull request #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1049345319


   @steveloughran Thanks for the review.
   
   I do agree with the overhead incurred by the added RPC of check existence. 
So the latest version of commit's idea is
    
   - `commitJob`: delete the directory if we are sure that the directory 
exists, that is, ` if (dynamicPartitionOverwrite || filesToMove.nonEmpty)`, so 
we don't need to do the check when the job is running successfully
   - `abortJob`: in this case, we can not decide if we have created the 
directory, and the frequency of invoking `abortJob` shouldn't be too much, so 
the check existence before deleting was used.
   
   > Similarly, for users of FileSystem, maybe some FileSystems do check the 
existence before deleting like Alluxio, but, IMHO, we can't ask all the 
FileSystem to do the same, it's better to do the check on users' side.
   > 
   > why?
   
   In fact, this ticket was brought up by the warning log of Alluxio. Since we 
are migrating some jobs to Alluxio, when I saw the warning log, I'd assume 
there might be something wrong with Alluxio, as HDFS didn't give back the logs 
like this. Even after checking the code of Spark, I still thought the problem 
comes from Alluxio. After deeper debugging, I found Alluxio is reporting the 
correct thing, although HDFS does return a result of "false" back to users.
   
   I think the Boolean value return by `fs.delete` is a little blur, it can be 
returned when we try to delete a nonexistent file, or we are trying to delete a 
file out of our permission, or something else. Not even to mention we didn't 
even process the result of this delete.
   
   Maybe it's the no-warning from HDFS and not-handling the result of deletion 
that gives the idea that the directory does exist and Alluxio is reporting the 
incorrect thing. So I think the point has changed from overhead and warning log 
to the mislead the delete has brought us, the sole ` fs.delete(stagingDir, 
true)` gives the incorrect idea that the stagingDir is always there.
   
   I even think the `check existence` in `abortJob` is not that important if we 
add some comments like "we are not sure if the stagingDir exists, so we try our 
best to delete the stagingDir". Only in `commitJob`, we can decide if the 
stagingDir exists, so when the stagingDir is not there, we don't have to do the 
deletion, so users should know the stagingDir wasn't a must-generated directory 
here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to