GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/13042
[SPARK-15263][Core] Make shuffle service dir cleanup faster by using `rm
-rf`
## What changes were proposed in this pull request?
Jira: https://issues.apache.org/jira/browse/SPARK-15263
The current logic for directory cleanup is slow because it does directory
listing, recurses over child directories, checks for symbolic links, deletes
leaf files and finally deletes the dirs when they are empty. There is
back-and-forth switching from kernel space to user space while doing this.
Since most of the deployment backends would be Unix systems, we could
essentially just do `rm -rf` so that entire deletion logic runs in kernel space.
The current Java based impl in Spark seems to be similar to what standard
libraries like guava and commons IO do (eg.
http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540).
However, guava removed this method in favour of shelling out to an operating
system command (like in this PR). See the `Deprecated` note in older javadocs
for guava for details :
http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File)
Ideally, Java should be providing such APIs so that users won't have to do
such things to get platform specific code. Also, its not just about speed, but
also handling race conditions while doing at FS deletions is tricky. I could
find this bug for Java in similar context :
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952
## How was this patch tested?
I am relying on existing test cases to test the method. If there are
suggestions about testing it, welcome to hear about it.
## Performance gains
*Input setup* : Created a nested directory structure of depth 3 and each
entry having 50 sub-dirs. The input being cleaned up had total ~125k dirs.
Ran both approaches (in isolation) for 6 times to get average numbers:
Native Java cleanup | `rm -rf` as a separate process
------------ | -------------
10.04 sec | 4.11 sec
This change made deletion 2.4 times faster for the given test input.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark delete_recursive
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13042.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13042
----
commit 32cc1e63fde168e71a6d392106f551e874889a22
Author: Tejas Patil <[email protected]>
Date: 2016-05-11T01:38:21Z
[SPARK-15263][Core] Make shuffle service dir cleanup faster by using `rm
-rf`
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]