HeartSaVioR opened a new pull request #25565: [SPARK-28025][SS][BRANCH-2.4] Fix 
FileContextBasedCheckpointFileManager leaking c…
URL: https://github.com/apache/spark/pull/25565
 
 
   ### What changes were proposed in this pull request?
   
   This PR fixes the leak of crc files from CheckpointFileManager when 
FileContextBasedCheckpointFileManager is being used.
   
   Spark hits the Hadoop bug, 
[HADOOP-16255](https://issues.apache.org/jira/browse/HADOOP-16255) which seems 
to be a long-standing issue.
   
   This is there're two `renameInternal` methods:
   
   ```
   public void renameInternal(Path src, Path dst)
   public void renameInternal(final Path src, final Path dst, boolean overwrite)
   ```
   
   which should be overridden to handle all cases but ChecksumFs only overrides 
method with 2 params, so when latter is called FilterFs.renameInternal(...) is 
called instead, and it will do rename with RawLocalFs as underlying filesystem.
   
   The bug is related to FileContext, so FileSystemBasedCheckpointFileManager 
is not affected.
   
   [SPARK-17475](https://issues.apache.org/jira/browse/SPARK-17475) took a 
workaround for this bug, but 
[SPARK-23966](https://issues.apache.org/jira/browse/SPARK-23966) seemed to 
bring regression.
   
   This PR deletes crc file as "best-effort" when renaming, as failing to 
delete crc file is not that critical to fail the task.
   
   ### Why are the changes needed?
   
   This PR prevents crc files not being cleaned up even purging batches. Too 
many files in same directory often hurts performance, as well as each crc file 
occupies more space than its own size so possible to occupy nontrivial amount 
of space when batches go up to 100000+.
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Some unit tests are modified to check leakage of crc files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to