vanzin commented on a change in pull request #23348: [SPARK-25857][core] Add developer documentation regarding delegation tokens. URL: https://github.com/apache/spark/pull/23348#discussion_r245743746
########## File path: core/src/main/scala/org/apache/spark/deploy/security/README.md ########## @@ -0,0 +1,238 @@ +# Delegation Token Handling In Spark + +This document aims to explain and demystify delegation tokens as they are used by Spark, since +this topic is generally a huge source of confusion. + + +## What are delegation tokens? + +Delegation tokens (DTs from now on) are authentication tokens used by some services to replace +Kerberos service tokens. Many services in the Hadoop ecosystem have support for DTs, since they +have two very desirable advantages over Kerberos tokens: + +* No need to distribute Kerberos credentials + +In a distributed application, distributing Kerberos credentials is tricky. Not all users have +keytabs, and when they do, it's generally frowned upon to distribute them over the network as +part of application data. + +DTs allow for a single place (e.g. the Spark driver) to require Kerberos credentials. That entity +can then distribute the DTs to other parts of the distributed application (e.g. Spark executors), +so they can authenticate to services. + +* A single token is used for authentication + +If Kerberos authentication were used, each client connection to a server would require a trip +to the KDC and generation of a service ticket. In a distributed system, the number of service +tickets can balloon pretty quickly when you think about the number of client processes (e.g. Spark +executors) vs. the number of service processes (e.g. HDFS DataNodes). That generates unnecessary +extra load on the KDC, and may even run into usage limits set up by the KDC admin. + + +So in short, DTs are *not* Kerberos tokens. They are used by many services to replace Kerberos +authentication, or even other forms of authentication, although there is nothing (aside from +maybe implementation details) that ties them to Kerberos or any other authentication mechanism. + + +## Lifecycle of DTs + +DTs, unlike Kerberos tokens, are service-specific. There is no centralized location you contact +to create a DT for a service. So, the first step needed to get a DT is being able to authenticate +to the service in question. In the Hadoop ecosystem, that is generally done using Kerberos. + +This requires Kerberos credentials to be available somewhere for the application to use. The user +is generally responsible for providing those credentials, which is most commonly done by logging +in to the KDC (e.g. using "kinit"). That generates a (Kerberos) "token cache" containing a TGT +(ticket granting ticket), which can then be used to request service tickets. + +There are other ways of obtaining TGTs, but, ultimately, you need a TGT to bootstrap the process. + +Once a TGT is available, the target service's client library can then be used to authenticate +to the service using the Kerberos credentials, and request the creation of a delegation token. +This token can now be sent to other processes and used to authenticate to different daemons +belonging to that service. + +And thus the first drawback of DTs becomes apparent: you need service-specific logic to create and +use them. While it would be possible to create a shared API or even a shared service to manage the +creation and use of DTs, that doesn't currently exist, and retrofitting such a system would be a +huge change in a bunch of different services. + +Spark works around this by having a (somewhat) pluggable, internal DT creation API. Support for new +services can be added by implementing a `HadoopDelegationTokenProvider` that is then called by Spark +when generating delegation tokens for an application. Spark distributes tokens to executors using +the `UserGroupInformation` Hadoop API, and it's up to the DT provider and the respective client +library to agree on how to use those tokens. + +Once they are created, the semantics of how DTs operate are also service-specific. But, in general, +they try to follow the semantics of Kerberos tokens: + +* A "lifetime" which is for how long the DT is valid before it requires renewal. +* A "renewable life" which is for how long the DT can be renewed. + +Once the token reaches its "renewable life", a new one needs to be created by contacting the +appropriate service, restarting the above process. + + +## DT Renewal, Renewers, and YARN + +This is the most confusing part of DT handling, and part of it is because much of the system was +designed with MapReduce, and later YARN, in mind. + +As seen above, DTs need to be renewed periodically until they finally expire for good. An example of +this is the default configuration of HDFS services: delegation tokens are valid for up to 7 days, +and need to be renewed every 24 hours. If 24 hours pass without the token being renewed, the token +cannot be used anymore. And the token cannot be renewed anymore after 7 days. + +This raises the question: who renews tokens? And for a long time the answer was YARN. + +When YARN applications are submitted, a set of DTs is also submitted with them. YARN takes care +of distributing these tokens to containers (using conventions set by the `UserGroupInformation` +API) and, also, keeping them renewed while the app is running. These tokens are used not just +by the application; they are also used by YARN itself to implement features like log collection +and aggregation. + +But this has a few caveats. + + +1. Who renews the tokens? + +This is handled mostly transparently by the Hadoop libraries in the case of YARN. Some services have +the concept of a token "renewer". This "renewer" is the name of the principal that is allowed to +renew the DT. When submitting to YARN, that will be the principal that the YARN service is running +as, which means that the client application needs to know that information. + +For other resource managers, the renewer mostly does not matter, since there is no service that +is doing the renewal. Except that it sometimes leaks into library code, such as in SPARK-20328. + + +2. What tokens are renewed? + +This is probably the biggest caveat. + +As discussed in the previous section, DTs are service-specific, and require service-specific +libraries for creation *and* renewal. This means that for YARN to be able to renew application +tokens, YARN needs: + +* The client libraries for all the services the application is using +* Information about how to connect to the services the application is using +* Permissions to connect to those services + +In reality, though, most of the time YARN has access to a single HDFS cluster, and that will be +the extent of its DT renewal features. Any other tokens sent to YARN will be distributed to Review comment: YARN does not have Hive client libraries in its classpath, so how can it even talk to the HMS or HS2 at all? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
