[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

vanzin Mon, 16 Oct 2017 17:04:48 -0700

GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/19509


    [SPARK-22290][core] Avoid creating Hive delegation tokens when not 
necessary.

    Hive delegation tokens are only needed when the Spark driver has no access
    to the kerberos TGT. That happens only in two situations:
    
    - when using a proxy user
    - when using cluster mode without a keytab
    
    This change modifies the Hive provider so that it only generates delegation
    tokens in those situations, and tweaks the YARN AM so that it makes the 
proper
    user visible to the Hive code when running with keytabs, so that the TGT
    can be used instead of a delegation token.
    
    The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM. Before, the second
    invocation would fail to fetch a new Hive delegation token, which then could
    make the second (or third or...) application fail once the token expired.
    With this change, the TGT will be used to authenticate to the HMS instead.
    
    This change also avoids polluting the current logged in user's credentials
    when launching applications. The credentials are copied only when running
    applications as a proxy user. This makes it possible to implement 
SPARK-11035
    later, where multiple threads might be launching applications, and each app
    should have its own set of credentials.
    
    Tested by verifying HDFS and Hive access in following scenarios:
    - client and cluster mode
    - client and cluster mode with proxy user
    - client and cluster mode with principal / keytab
    - long-running cluster app with principal / keytab
    - pyspark app that creates (and stops) multiple SparkContext instances
      through its lifetime


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-22290

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19509
    
----
commit 95a9658043c86187cd9143923d0c1307df449004
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2017-10-16T22:28:58Z

    [SPARK-22290][core] Avoid creating Hive delegation tokens when not 
necessary.
    
    Hive delegation tokens are only needed when the Spark driver has no access
    to the kerberos TGT. That happens only in two situations:
    
    - when using a proxy user
    - when using cluster mode without a keytab
    
    This change modifies the Hive provider so that it only generates delegation
    tokens in those situations, and tweaks the YARN AM so that it makes the 
proper
    user visible to the Hive code when running with keytabs, so that the TGT
    can be used instead of a delegation token.
    
    The effect of this change is that now it's possible to initialize multiple,
    non-concurrent SparkContext instances in the same JVM. Before, the second
    invocation would fail to fetch a new Hive delegation token, which then could
    make the second (or third or...) application fail once the token expired.
    With this change, the TGT will be used to authenticate to the HMS instead.
    
    This change also avoids polluting the current logged in user's credentials
    when launching applications. The credentials are copied only when running
    applications as a proxy user. This makes it possible to implement 
SPARK-11035
    later, where multiple threads might be launching applications, and each app
    should have its own set of credentials.
    
    Tested by verifying HDFS and Hive access in following scenarios:
    - client and cluster mode
    - client and cluster mode with proxy user
    - client and cluster mode with principal / keytab
    - long-running cluster app with principal / keytab
    - pyspark app that creates (and stops) multiple SparkContext instances
      through its lifetime

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

Reply via email to