GitHub user sarutak opened a pull request:

    https://github.com/apache/spark/pull/1326

    [SPARK-2390] Files in staging directory cannot be deleted and wastes the 
space of HDFS

    When running jobs with YARN Cluster mode and using HistoryServer, the files 
in the Staging Directory cannot be deleted.
    HistoryServer uses directory where event log is written, and the directory 
is represented as a instance of o.a.h.f.FileSystem created by using 
FileSystem.get.
    
    On the other hand, ApplicationMaster has a instance named fs, which also 
created by using FileSystem.get.
    
    FileSystem.get returns cached same instance when URI passed to the method 
represents same file system and the method is called by same user.
    Because of the behavior, when the directory for event log is on HDFS, fs of 
ApplicationMaster and fileSystem of FileLogger is same instance.
    When shutting down ApplicationMaster, fileSystem.close is called in 
FileLogger#stop, which is invoked by SparkContext#stop indirectly.
    
    And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. 
In this method, fs.delete(stagingDirPath) is invoked.
    Because fs.delete in ApplicationMaster is called after fileSystem.close in 
FileLogger, fs.delete fails and results not deleting files in the staging 
directory.
    
    I think, calling fileSystem.delete is not needed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sarutak/spark SPARK-2390

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1326.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1326
    
----
commit 10e1a88d112dcf1bbb4fbb2152de81fe403fae02
Author: Kousuke Saruta <[email protected]>
Date:   2014-07-08T00:29:21Z

    Removed fileSystem.close from FileLogger.scala not to prevent any other 
FileSystem operation

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to