Wei-Chiu Chuang created SPARK-37329:
---------------------------------------

             Summary: File system delegation tokens are leaked
                 Key: SPARK-37329
                 URL: https://issues.apache.org/jira/browse/SPARK-37329
             Project: Spark
          Issue Type: Bug
          Components: Security, YARN
    Affects Versions: 2.4.0
            Reporter: Wei-Chiu Chuang


On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
accumulated millions of delegation tokens that are not cancelled even after 
jobs are finished, and KMS goes out of memory within a day because of the 
delegation token leak.

We were able to reproduce the bug in a smaller test cluster, and realized when 
a Spark job starts, it acquires two delegation tokens, and only one is 
cancelled properly after the job finishes. The other one is left over and 
linger around for up to 7 days ( default Hadoop delegation token life time).

YARN handles the lifecycle of a delegation token properly if its renewer is 
'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
token with the job issuer as the renewer, simply to get the token renewal 
interval. The token is then ignored but not cancelled.

Propose: cancel the delegation token immediately after the token renewal 
interval is obtained.

Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to