[GitHub] spark pull request: SPARK-1676: Cache Hadoop UGIs by default to pr...

aarondav Wed, 30 Apr 2014 18:37:32 -0700

GitHub user aarondav opened a pull request:

    https://github.com/apache/spark/pull/607


    SPARK-1676: Cache Hadoop UGIs by default to prevent FileSystem leak

    UserGroupInformation objects (UGIs) are used for Hadoop security. A 
relatively recent PR (#29) makes Spark always use UGIs when executing tasks. 
Unfortunately, this causes 
[HDFS-3545](https://issues.apache.org/jira/browse/HDFS-3545), which causes the 
FileSystem cache to continuously create new FileSystems, as the UGIs look 
different (even though they're logically identical). This causes a memory and 
sometimes file descriptor leak for FileSystems (like S3N) which maintain open 
connections.
    
    This solution is to introduce a config option (enabled by default) which 
reuses a single Spark user UGI, rather than creating new ones for each task. 
The downside to this approach is that UGIs cannot be safely cached (see the 
notes in [HDFS-3545](https://issues.apache.org/jira/browse/HDFS-3545)). For 
example, if a token expires, it will never be cleared from the UGI but may be 
used anyway (usage of a particular token on a UGI is nondeterministic as it is 
backed by a Set).
    
    This setting is enabled by default because the memory leak can become 
serious very quickly. In one benchmark, attempting to read 10k files from an S3 
directory caused 45k connections to remain open to S3 after the job completed. 
These file descriptors are never cleaned up, nor the memory used by the 
associated FileSystems.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aarondav/spark ugi

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #607
    
----
commit e9009e046e496cab6c1a226756eca09b545c7179
Author: Aaron Davidson <[email protected]>
Date:   2014-04-30T20:51:54Z

    SPARK-1676 Cache Hadoop UGIs by default to prevent FileSystem leak
    
    UserGroupInformation objects (UGIs) are used for Hadoop security. A 
relatively
    recent PR (#29) makes Spark always use UGIs when executing tasks. 
Unfortunately,
    this causes HDFS-3545, which causes the FileSystem cache to continuously 
create
    new FileSystems, as the UGIs look different (even though they're logically
    identical). This causes a memory and sometimes file descriptor leak for 
FileSystems
    (like S3N) which maintain open connections.
    
    This solution is to introduce a config option (enabled by default) which 
reuses a
    single Spark user UGI, rather than creating new ones for each task. The 
downside
    to this approach is that UGIs cannot be safely cached (see the notes in 
HDFS-3545).
    For example, if a token expires, it will never be cleared from the UGI but 
may
    be used anyway (usage of a particular token on a UGI is nondeterministic as 
it is
    backed by a Set).
    
    This setting is enabled by default because the memory leak can become 
serious
    very quickly. In one benchmark, attempting to read 10k files from an S3 
directory
    caused 45k connections to remain open to S3 after the job completed. These 
file
    descriptors are never cleaned up, nor the memory used by the associated
    FileSystems.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1676: Cache Hadoop UGIs by default to pr...

Reply via email to