[
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444890#comment-17444890
]
Wei-Chiu Chuang commented on SPARK-37329:
-----------------------------------------
I should also note that this affects not just KMS, but any file system
implementation (HDFS, Ozone, perhaps S3) with delegation token support.
> File system delegation tokens are leaked
> ----------------------------------------
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
> Issue Type: Bug
> Components: Security, YARN
> Affects Versions: 2.4.0
> Reporter: Wei-Chiu Chuang
> Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS
> accumulated millions of delegation tokens that are not cancelled even after
> jobs are finished, and KMS goes out of memory within a day because of the
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized
> when a Spark job starts, it acquires two delegation tokens, and only one is
> cancelled properly after the job finishes. The other one is left over and
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation
> token with the job issuer as the renewer, simply to get the token renewal
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got
> introduced since day 1.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]