GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/4106

    [SPARK-5158] [core] [security] Spark standalone mode can authenticate 
against a Kerberos-secured Hadoop cluster

    Previously, Kerberos secured Hadoop clusters could only be accessed by 
Spark running on top of YARN. In other words, Spark standalone clusters had no 
way to read from secure Hadoop clusters. Other solutions were proposed 
previously, but all of them attempted to perform authentication by obtaining
    a token on a single node and passing that token around to all of the other 
Spark worker nodes. The shipping of the token is risky, and all previous 
iterations fell short in leaving the token open to man-in-the-middle attacks.
    
    This patch introduces an alternative approach. It assumes that the keytab 
file has already been distributed to every node in the cluster. When Spark 
starts in standalone mode, all of the workers individually log in via Kerberos 
using the principal and keytab file specified in hdfs-site.xml. We can assume 
this will be well formed because on standalone configurations all of the worker 
nodes should be using the same hdfs-site.xml configurations as the Hadoop 
cluster itself. In addition, on basic Hadoop cluster setups the key tab file is 
often already manually deployed on all of the cluster's nodes; it's not a huge 
stretch to expect the keytab files to be deployed to the Spark worker nodes as 
well, if they are not already there.
    
    There are a number of caveats to this approach. Firstly, it assumes that 
Spark will always authenticate with Kerberos using the same principal and 
keytab, and the login is done at the very start of the job. Strictly speaking 
we should be trying to reduce the surface area of the region of code that 
operate under a logged-in state. Or to put it another way, the authentication 
should only be performed precisely when files are written or read from HDFS, 
and after the read or write is performed the subject should be logged out. 
However this is difficult to write and prone to errors, so this is left for a 
future refactor.
    
    More concerningly, the code does not actually execute "kinit", and each of 
the executor nodes need to run kinit manually before starting the job. It is 
suggested that a call to kinit with the appropriate principal and keytab is 
done in spark-env.sh, and we should document this as being the case. I remark 
that UserGroupInformation.loginUserFromKeytab(...) does not actually run kinit, 
but merely creates a "delegation token", but doing this without running kinit 
still makes the Spark job crash with an exception message along the lines of 
"Unable to find tgt...". Any suggestions as to how to actually execute "kinit" 
in Java are appreciated; a system call is flaky at best due to cross-platform 
issues.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark hadoop-kerberos

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4106
    
----
commit be41dac005442074ae6c84d33d5661c2a59134e9
Author: mccheah <[email protected]>
Date:   2015-01-14T02:58:26Z

    [SPARK-5158] Spark standalone mode can authenticate against a 
Kerberos-secured Hadoop cluster
    
    Previously, Kerberos secured Hadoop clusters could only be accessed by 
Spark running on top of YARN.
    In other words, Spark standalone clusters had no way to read from secure 
Hadoop clusters. Other
    solutions were proposed previously, but all of them attempted to perform 
authentication by obtaining
    a token on a single node and passing that token around to all of the other 
Spark worker nodes. The
    shipping of the token is risky, and all previous iterations fell short in 
leaving the token open
    to man-in-the-middle attacks.
    
    This patch introduces an alternative approach. It assumes that the keytab 
file has already been
    distributed to every node in the cluster. When Spark starts in standalone 
mode, all of the workers
    individually log in via Kerberos using the principal and keytab file 
specified in hdfs-site.xml.
    We can assume this will be well formed because on standalone configurations 
all of the worker nodes
    should be using the same hdfs-site.xml configurations as the Hadoop cluster 
itself. In addition, on
    basic Hadoop cluster setups the key tab file is often already manually 
deployed on all of the cluster's
    nodes; it's not a huge stretch to expect the keytab files to be deployed to 
the Spark worker nodes as
    well, if they are not already there.
    
    There are a number of caveats to this approach. Firstly, it assumes that 
Spark will always authenticate
    with Kerberos using the same principal and keytab, and the login is done at 
the very start of the job.
    Strictly speaking we should be trying to reduce the surface area of the 
region of code that operates
    under a logged-in state. Or to put it another way, the authentication 
should only be performed precisely
    when files are written or read from HDFS, and after the read or write is 
performed the subject should be
    logged out. However this is difficult to write and prone to errors, so this 
is left for a future refactor.
    
    More concerningly, the code does not actually execute "kinit", and each of 
the executor nodes need to run
    kinit manually before starting the job. It is suggested that a call to 
kinit with the appropriate principal
    and keytab is done in spark-env.sh, and we should document this as being 
the case. I remark that
    UserGroupInformation.loginUserFromKeytab(...) does not actually run kinit, 
but merely creates a "delegation
    token", but doing this without running kinit still makes the Spark job 
crash with an exception message along
    the lines of "Unable to find tgt...". Any suggestions as to how to actually 
execute "kinit" in Java are
    appreciated; a system call is flaky at best due to cross-platform issues.

commit 5a7bd66cf11e3ec2e72f092211bdf2bc641424fb
Author: mcheah <[email protected]>
Date:   2015-01-19T20:08:15Z

    Making misconfigured Hadoop security settings on standalone mode fail-fast.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to