GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/4106
[SPARK-5158] [core] [security] Spark standalone mode can authenticate
against a Kerberos-secured Hadoop cluster
Previously, Kerberos secured Hadoop clusters could only be accessed by
Spark running on top of YARN. In other words, Spark standalone clusters had no
way to read from secure Hadoop clusters. Other solutions were proposed
previously, but all of them attempted to perform authentication by obtaining
a token on a single node and passing that token around to all of the other
Spark worker nodes. The shipping of the token is risky, and all previous
iterations fell short in leaving the token open to man-in-the-middle attacks.
This patch introduces an alternative approach. It assumes that the keytab
file has already been distributed to every node in the cluster. When Spark
starts in standalone mode, all of the workers individually log in via Kerberos
using the principal and keytab file specified in hdfs-site.xml. We can assume
this will be well formed because on standalone configurations all of the worker
nodes should be using the same hdfs-site.xml configurations as the Hadoop
cluster itself. In addition, on basic Hadoop cluster setups the key tab file is
often already manually deployed on all of the cluster's nodes; it's not a huge
stretch to expect the keytab files to be deployed to the Spark worker nodes as
well, if they are not already there.
There are a number of caveats to this approach. Firstly, it assumes that
Spark will always authenticate with Kerberos using the same principal and
keytab, and the login is done at the very start of the job. Strictly speaking
we should be trying to reduce the surface area of the region of code that
operate under a logged-in state. Or to put it another way, the authentication
should only be performed precisely when files are written or read from HDFS,
and after the read or write is performed the subject should be logged out.
However this is difficult to write and prone to errors, so this is left for a
future refactor.
More concerningly, the code does not actually execute "kinit", and each of
the executor nodes need to run kinit manually before starting the job. It is
suggested that a call to kinit with the appropriate principal and keytab is
done in spark-env.sh, and we should document this as being the case. I remark
that UserGroupInformation.loginUserFromKeytab(...) does not actually run kinit,
but merely creates a "delegation token", but doing this without running kinit
still makes the Spark job crash with an exception message along the lines of
"Unable to find tgt...". Any suggestions as to how to actually execute "kinit"
in Java are appreciated; a system call is flaky at best due to cross-platform
issues.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mccheah/spark hadoop-kerberos
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4106.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4106
----
commit be41dac005442074ae6c84d33d5661c2a59134e9
Author: mccheah <[email protected]>
Date: 2015-01-14T02:58:26Z
[SPARK-5158] Spark standalone mode can authenticate against a
Kerberos-secured Hadoop cluster
Previously, Kerberos secured Hadoop clusters could only be accessed by
Spark running on top of YARN.
In other words, Spark standalone clusters had no way to read from secure
Hadoop clusters. Other
solutions were proposed previously, but all of them attempted to perform
authentication by obtaining
a token on a single node and passing that token around to all of the other
Spark worker nodes. The
shipping of the token is risky, and all previous iterations fell short in
leaving the token open
to man-in-the-middle attacks.
This patch introduces an alternative approach. It assumes that the keytab
file has already been
distributed to every node in the cluster. When Spark starts in standalone
mode, all of the workers
individually log in via Kerberos using the principal and keytab file
specified in hdfs-site.xml.
We can assume this will be well formed because on standalone configurations
all of the worker nodes
should be using the same hdfs-site.xml configurations as the Hadoop cluster
itself. In addition, on
basic Hadoop cluster setups the key tab file is often already manually
deployed on all of the cluster's
nodes; it's not a huge stretch to expect the keytab files to be deployed to
the Spark worker nodes as
well, if they are not already there.
There are a number of caveats to this approach. Firstly, it assumes that
Spark will always authenticate
with Kerberos using the same principal and keytab, and the login is done at
the very start of the job.
Strictly speaking we should be trying to reduce the surface area of the
region of code that operates
under a logged-in state. Or to put it another way, the authentication
should only be performed precisely
when files are written or read from HDFS, and after the read or write is
performed the subject should be
logged out. However this is difficult to write and prone to errors, so this
is left for a future refactor.
More concerningly, the code does not actually execute "kinit", and each of
the executor nodes need to run
kinit manually before starting the job. It is suggested that a call to
kinit with the appropriate principal
and keytab is done in spark-env.sh, and we should document this as being
the case. I remark that
UserGroupInformation.loginUserFromKeytab(...) does not actually run kinit,
but merely creates a "delegation
token", but doing this without running kinit still makes the Spark job
crash with an exception message along
the lines of "Unable to find tgt...". Any suggestions as to how to actually
execute "kinit" in Java are
appreciated; a system call is flaky at best due to cross-platform issues.
commit 5a7bd66cf11e3ec2e72f092211bdf2bc641424fb
Author: mcheah <[email protected]>
Date: 2015-01-19T20:08:15Z
Making misconfigured Hadoop security settings on standalone mode fail-fast.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]