Wei-Chiu Chuang created SPARK-37329:
---------------------------------------
Summary: File system delegation tokens are leaked
Key: SPARK-37329
URL: https://issues.apache.org/jira/browse/SPARK-37329
Project: Spark
Issue Type: Bug
Components: Security, YARN
Affects Versions: 2.4.0
Reporter: Wei-Chiu Chuang
On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS
accumulated millions of delegation tokens that are not cancelled even after
jobs are finished, and KMS goes out of memory within a day because of the
delegation token leak.
We were able to reproduce the bug in a smaller test cluster, and realized when
a Spark job starts, it acquires two delegation tokens, and only one is
cancelled properly after the job finishes. The other one is left over and
linger around for up to 7 days ( default Hadoop delegation token life time).
YARN handles the lifecycle of a delegation token properly if its renewer is
'yarn'. However, Spark intentionally (a hack?) acquires a second delegation
token with the job issuer as the renewer, simply to get the token renewal
interval. The token is then ignored but not cancelled.
Propose: cancel the delegation token immediately after the token renewal
interval is obtained.
Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got
introduced since day 1.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]