GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/607
SPARK-1676: Cache Hadoop UGIs by default to prevent FileSystem leak
UserGroupInformation objects (UGIs) are used for Hadoop security. A
relatively recent PR (#29) makes Spark always use UGIs when executing tasks.
Unfortunately, this causes
[HDFS-3545](https://issues.apache.org/jira/browse/HDFS-3545), which causes the
FileSystem cache to continuously create new FileSystems, as the UGIs look
different (even though they're logically identical). This causes a memory and
sometimes file descriptor leak for FileSystems (like S3N) which maintain open
connections.
This solution is to introduce a config option (enabled by default) which
reuses a single Spark user UGI, rather than creating new ones for each task.
The downside to this approach is that UGIs cannot be safely cached (see the
notes in [HDFS-3545](https://issues.apache.org/jira/browse/HDFS-3545)). For
example, if a token expires, it will never be cleared from the UGI but may be
used anyway (usage of a particular token on a UGI is nondeterministic as it is
backed by a Set).
This setting is enabled by default because the memory leak can become
serious very quickly. In one benchmark, attempting to read 10k files from an S3
directory caused 45k connections to remain open to S3 after the job completed.
These file descriptors are never cleaned up, nor the memory used by the
associated FileSystems.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark ugi
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/607.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #607
----
commit e9009e046e496cab6c1a226756eca09b545c7179
Author: Aaron Davidson <[email protected]>
Date: 2014-04-30T20:51:54Z
SPARK-1676 Cache Hadoop UGIs by default to prevent FileSystem leak
UserGroupInformation objects (UGIs) are used for Hadoop security. A
relatively
recent PR (#29) makes Spark always use UGIs when executing tasks.
Unfortunately,
this causes HDFS-3545, which causes the FileSystem cache to continuously
create
new FileSystems, as the UGIs look different (even though they're logically
identical). This causes a memory and sometimes file descriptor leak for
FileSystems
(like S3N) which maintain open connections.
This solution is to introduce a config option (enabled by default) which
reuses a
single Spark user UGI, rather than creating new ones for each task. The
downside
to this approach is that UGIs cannot be safely cached (see the notes in
HDFS-3545).
For example, if a token expires, it will never be cleared from the UGI but
may
be used anyway (usage of a particular token on a UGI is nondeterministic as
it is
backed by a Set).
This setting is enabled by default because the memory leak can become
serious
very quickly. In one benchmark, attempting to read 10k files from an S3
directory
caused 45k connections to remain open to S3 after the job completed. These
file
descriptors are never cleaned up, nor the memory used by the associated
FileSystems.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---