GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/618
SPARK-1676: (branch-0.9 fix) Cache Hadoop UGIs to prevent FileSystem leak
This is a followup patch to #607, which contains a discussion about the
proper solution for this problem. This PR is pointed to branch-0.9 to provide a
very low-impact fix that allows users to enable UGI caching if they run into
this problem without affecting the default behavior.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark ugi
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/618.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #618
----
commit cca6a8a968d0073fcf918ed1e40a3719640d2b37
Author: Aaron Davidson <[email protected]>
Date: 2014-04-30T20:51:54Z
SPARK-1676 Cache Hadoop UGIs by default to prevent FileSystem leak
UserGroupInformation objects (UGIs) are used for Hadoop security. A
relatively
recent PR (#29) makes Spark always use UGIs when executing tasks.
Unfortunately,
this causes HDFS-3545, which causes the FileSystem cache to continuously
create
new FileSystems, as the UGIs look different (even though they're logically
identical). This causes a memory and sometimes file descriptor leak for
FileSystems
(like S3N) which maintain open connections.
This solution is to introduce a config option (enabled by default) which
reuses a
single Spark user UGI, rather than creating new ones for each task. The
downside
to this approach is that UGIs cannot be safely cached (see the notes in
HDFS-3545).
For example, if a token expires, it will never be cleared from the UGI but
may
be used anyway (usage of a particular token on a UGI is nondeterministic as
it is
backed by a Set).
This setting is enabled by default because the memory leak can become
serious
very quickly. In one benchmark, attempting to read 10k files from an S3
directory
caused 45k connections to remain open to S3 after the job completed. These
file
descriptors are never cleaned up, nor the memory used by the associated
FileSystems.
Conflicts:
docs/configuration.md
yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
commit c8f29bca37b42aa639d679e1de76a86bcf00d12e
Author: Aaron Davidson <[email protected]>
Date: 2014-05-02T03:51:59Z
Set to false by default and point towards branch-0.9
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---