Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/11033#discussion_r53528281
--- Diff: docs/running-on-yarn.md ---
@@ -441,3 +441,81 @@ If you need a reference to the proper location to put
log files in the YARN so t
- In `cluster` mode, the local directories used by the Spark executors and
the Spark driver will be the local directories configured for YARN (Hadoop YARN
config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`,
it will be ignored. In `client` mode, the Spark executors will use the local
directories configured for YARN while the Spark driver will use those defined
in `spark.local.dir`. This is because the Spark driver does not run on the YARN
cluster in `client` mode, only the Spark executors do.
- The `--files` and `--archives` options support specifying file names
with the # similar to Hadoop. For example you can specify: `--files
localtest.txt#appSees.txt` and this will upload the file you have locally named
`localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`,
and your application should use the name as `appSees.txt` to reference it when
running on YARN.
- The `--jars` option allows the `SparkContext.addJar` function to work if
you are using it with local files and running in `cluster` mode. It does not
need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
+
+# Running in a secure YARN cluster
+
+As covered in [security](security.html), Kerberos is used in a secure YARN
cluster to
+authenticate principals with services âand so access these services and
+their data. These services issue 'hadoop tokens' to grant access to the
service and data,
+tokens which are then supplied over Hadoop IPC and REST/Web APIs as proof
of access rights.
+For YARN applications to interact with HDFS, HBase and Hive, the
application must request tokens
+using the kerberos credentials of the user launching the cluster (the
principal).
+
+This is normally done at launch time: Spark will automatically obtain a
token for the cluster's
+HDFS filesystem, and optionally HBase and Hive.
+
+If an application needs to interact with other secure HDFS clusters, then
+the tokens needed to access these clusters must be explicitly requested at
+launch time. This is done by listing them in the
`spark.yarn.access.namenodes` property.
+
+```
+spark.yarn.access.namenodes
hdfs://ireland.emea.example.org:8020/,hdfs://frankfurt.emea.example.org:8020/
+```
+
+Hadoop tokens expire. They can be renewed "for a while"; the Spark
Application Master will automatically
+do this. However, eventually, they will stop being renewable âafter
which all attempts to
+access secure data will fail. The only way to avoid that is for the
application to be launched
+with the secrets needed to log in to Kerberos directly: a keytab.
+
+## Launching your application with Apache Oozie
+
+Apache Oozie can launch Spark. In a secure cluster, the launched
application will need the relevant
+tokens to access the cluster's services.
+If Spark is launched with a keytab, this is automatic. However, if Spark
is to be
+launched without a keytab, the responsibility for setting up security must
be handed
+over to Oozie.
+
+The Oozie workflow configuration must be set up for Oozie to request all
tokens, including:
--- End diff --
referring to the oozie docs will probably be the most maintained. Which URL
would you recommend?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]