vanzin commented on a change in pull request #23348: [SPARK-25857][core] Add 
developer documentation regarding delegation tokens.
URL: https://github.com/apache/spark/pull/23348#discussion_r245743329
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/deploy/security/README.md
 ##########
 @@ -0,0 +1,238 @@
+# Delegation Token Handling In Spark
+
+This document aims to explain and demystify delegation tokens as they are used 
by Spark, since
+this topic is generally a huge source of confusion.
+
+
+## What are delegation tokens?
+
+Delegation tokens (DTs from now on) are authentication tokens used by some 
services to replace
+Kerberos service tokens. Many services in the Hadoop ecosystem have support 
for DTs, since they
+have two very desirable advantages over Kerberos tokens:
+
+* No need to distribute Kerberos credentials
+
+In a distributed application, distributing Kerberos credentials is tricky. Not 
all users have
+keytabs, and when they do, it's generally frowned upon to distribute them over 
the network as
+part of application data.
+
+DTs allow for a single place (e.g. the Spark driver) to require Kerberos 
credentials. That entity
+can then distribute the DTs to other parts of the distributed application 
(e.g. Spark executors),
+so they can authenticate to services.
+
+* A single token is used for authentication
+
+If Kerberos authentication were used, each client connection to a server would 
require a trip
+to the KDC and generation of a service ticket. In a distributed system, the 
number of service
+tickets can balloon pretty quickly when you think about the number of client 
processes (e.g. Spark
+executors) vs. the number of service processes (e.g. HDFS DataNodes). That 
generates unnecessary
+extra load on the KDC, and may even run into usage limits set up by the KDC 
admin.
+
+
+So in short, DTs are *not* Kerberos tokens. They are used by many services to 
replace Kerberos
+authentication, or even other forms of authentication, although there is 
nothing (aside from
+maybe implementation details) that ties them to Kerberos or any other 
authentication mechanism.
+
+
+## Lifecycle of DTs
+
+DTs, unlike Kerberos tokens, are service-specific. There is no centralized 
location you contact
+to create a DT for a service. So, the first step needed to get a DT is being 
able to authenticate
+to the service in question. In the Hadoop ecosystem, that is generally done 
using Kerberos.
+
+This requires Kerberos credentials to be available somewhere for the 
application to use. The user
+is generally responsible for providing those credentials, which is most 
commonly done by logging
+in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" 
containing a TGT
+(ticket granting ticket), which can then be used to request service tickets.
+
+There are other ways of obtaining TGTs, but, ultimately, you need a TGT to 
bootstrap the process.
+
+Once a TGT is available, the target service's client library can then be used 
to authenticate
+to the service using the Kerberos credentials, and request the creation of a 
delegation token.
+This token can now be sent to other processes and used to authenticate to 
different daemons
+belonging to that service.
+
+And thus the first drawback of DTs becomes apparent: you need service-specific 
logic to create and
+use them. While it would be possible to create a shared API or even a shared 
service to manage the
+creation and use of DTs, that doesn't currently exist, and retrofitting such a 
system would be a
+huge change in a bunch of different services.
+
+Spark works around this by having a (somewhat) pluggable, internal DT creation 
API. Support for new
+services can be added by implementing a `HadoopDelegationTokenProvider` that 
is then called by Spark
+when generating delegation tokens for an application. Spark distributes tokens 
to executors using
+the `UserGroupInformation` Hadoop API, and it's up to the DT provider and the 
respective client
 
 Review comment:
   You're thinking about writing the tokens to a file and automatically loading 
them, which is a YARN-only thing. Spark has its own mechanism to distribute 
tokens, which it uses even in YARN.
   
   The YARN token API is mostly used so YARN can use the app's tokens for log 
aggregation and stuff.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to