[GitHub] spark pull request: SPARK-1676: (branch-0.9 fix) Cache Hadoop UGIs...

aarondav Thu, 01 May 2014 22:04:38 -0700

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/618


    SPARK-1676: (branch-0.9 fix) Cache Hadoop UGIs to prevent FileSystem leak

    This is a followup patch to #607, which contains a discussion about the 
proper solution for this problem. This PR is pointed to branch-0.9 to provide a 
very low-impact fix that allows users to enable UGI caching if they run into 
this problem without affecting the default behavior.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark ugi

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/618.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #618
    
----
commit cca6a8a968d0073fcf918ed1e40a3719640d2b37
Author: Aaron Davidson <[email protected]>
Date:   2014-04-30T20:51:54Z

    SPARK-1676 Cache Hadoop UGIs by default to prevent FileSystem leak
    
    UserGroupInformation objects (UGIs) are used for Hadoop security. A 
relatively
    recent PR (#29) makes Spark always use UGIs when executing tasks. 
Unfortunately,
    this causes HDFS-3545, which causes the FileSystem cache to continuously 
create
    new FileSystems, as the UGIs look different (even though they're logically
    identical). This causes a memory and sometimes file descriptor leak for 
FileSystems
    (like S3N) which maintain open connections.
    
    This solution is to introduce a config option (enabled by default) which 
reuses a
    single Spark user UGI, rather than creating new ones for each task. The 
downside
    to this approach is that UGIs cannot be safely cached (see the notes in 
HDFS-3545).
    For example, if a token expires, it will never be cleared from the UGI but 
may
    be used anyway (usage of a particular token on a UGI is nondeterministic as 
it is
    backed by a Set).
    
    This setting is enabled by default because the memory leak can become 
serious
    very quickly. In one benchmark, attempting to read 10k files from an S3 
directory
    caused 45k connections to remain open to S3 after the job completed. These 
file
    descriptors are never cleaned up, nor the memory used by the associated
    FileSystems.
    
    Conflicts:
        docs/configuration.md
        
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

commit c8f29bca37b42aa639d679e1de76a86bcf00d12e
Author: Aaron Davidson <[email protected]>
Date:   2014-05-02T03:51:59Z

    Set to false by default and point towards branch-0.9

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1676: (branch-0.9 fix) Cache Hadoop UGIs...

Reply via email to