Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11033#discussion_r55577866
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -441,3 +441,91 @@ If you need a reference to the proper location to put 
log files in the YARN so t
     - In `cluster` mode, the local directories used by the Spark executors and 
the Spark driver will be the local directories configured for YARN (Hadoop YARN 
config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, 
it will be ignored. In `client` mode, the Spark executors will use the local 
directories configured for YARN while the Spark driver will use those defined 
in `spark.local.dir`. This is because the Spark driver does not run on the YARN 
cluster in `client` mode, only the Spark executors do.
     - The `--files` and `--archives` options support specifying file names 
with the # similar to Hadoop. For example you can specify: `--files 
localtest.txt#appSees.txt` and this will upload the file you have locally named 
`localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`, 
and your application should use the name as `appSees.txt` to reference it when 
running on YARN.
     - The `--jars` option allows the `SparkContext.addJar` function to work if 
you are using it with local files and running in `cluster` mode. It does not 
need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
    +
    +# Running in a secure YARN cluster
    +
    +As covered in [security](security.html), Kerberos is used in a secure YARN 
cluster to
    +authenticate principals associated with services and clients. This allows 
clients to
    +make requests of these authenticated services; the services to grant rights
    +to the authenticated principals.
    +
    +Hadoop services issue *hadoop tokens* to grant access to the services and 
data,
    +tokens which the client must supply over Hadoop IPC and REST/Web APIs as 
proof of access rights.
    +For Spark applications launched in a YARN cluster to interact with HDFS, 
HBase and Hive,
    +the application must acquire the relevant tokens
    +using the Kerberos credentials of the user launching the application 
—that is, the principal whose
    +identity will become that of the launched Spark application.
    +
    +This is normally done at launch time: in a secure cluster Spark will 
automatically obtain a
    +token for the cluster's HDFS filesystem, and, if required, HBase and Hive.
    +
    +If an application needs to interact with other secure HDFS clusters, then
    +the tokens needed to access these clusters must be explicitly requested at
    +launch time. This is done by listing them in the 
`spark.yarn.access.namenodes` property.
    +
    +```
    +spark.yarn.access.namenodes 
hdfs://ireland.emea.example.org:8020/,hdfs://frankfurt.emea.example.org:8020/
    +```
    +
    +Hadoop tokens expire. They can be renewed "for a while";
    +However, eventually, they will stop being renewable —after which all 
attempts to
    +access secure data will fail. The only way to avoid that is for the 
application to be launched
    +with the secrets needed to log in to Kerberos directly: a "keytab". Consult
    +the [Spark Property](#Spark Properties) `spark.yarn.keytab` for the 
specifics.
    +
    +## Launching your application with Apache Oozie
    +
    +Apache Oozie can launch Spark.
    +In a secure cluster, such an application will need the relevant tokens to 
access the cluster's
    +services. If Spark is launched with a keytab, this is automatic.
    +However, if Spark is to be launched without a keytab, the responsibility 
for setting up security
    +must be handed over to Oozie.
    +
    +The details of configuring Oozie for secure clusters and obtaining
    +credentials for a job can be found on the [Oozie web 
site](http://oozie.apache.org/)
    +in the "Authentication" section of the specific release's documentation.
    + 
    +For Spark applications, the Oozie workflow must be set up for Oozie to 
request all tokens, including:
    --- End diff --
    
    sorry not sure I follow, I don't need to tell oozie to get hive and hbase 
tokens if I'm not going to use hive or hbase in my spark job.  So I'll I'm 
saying it perhaps like YARN timeline server below add in something that says if 
your application uses them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to