GitHub user sarutak opened a pull request:
https://github.com/apache/spark/pull/1326
[SPARK-2390] Files in staging directory cannot be deleted and wastes the
space of HDFS
When running jobs with YARN Cluster mode and using HistoryServer, the files
in the Staging Directory cannot be deleted.
HistoryServer uses directory where event log is written, and the directory
is represented as a instance of o.a.h.f.FileSystem created by using
FileSystem.get.
On the other hand, ApplicationMaster has a instance named fs, which also
created by using FileSystem.get.
FileSystem.get returns cached same instance when URI passed to the method
represents same file system and the method is called by same user.
Because of the behavior, when the directory for event log is on HDFS, fs of
ApplicationMaster and fileSystem of FileLogger is same instance.
When shutting down ApplicationMaster, fileSystem.close is called in
FileLogger#stop, which is invoked by SparkContext#stop indirectly.
And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook.
In this method, fs.delete(stagingDirPath) is invoked.
Because fs.delete in ApplicationMaster is called after fileSystem.close in
FileLogger, fs.delete fails and results not deleting files in the staging
directory.
I think, calling fileSystem.delete is not needed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sarutak/spark SPARK-2390
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1326.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1326
----
commit 10e1a88d112dcf1bbb4fbb2152de81fe403fae02
Author: Kousuke Saruta <[email protected]>
Date: 2014-07-08T00:29:21Z
Removed fileSystem.close from FileLogger.scala not to prevent any other
FileSystem operation
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---