[
https://issues.apache.org/jira/browse/HADOOP-16130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797558#comment-16797558
]
Daryn Sharp commented on HADOOP-16130:
--------------------------------------
The problem was accumulation of tokens not being cancelled. Jobs would
erroneously set an option for the RM to never cancel tokens after job
completion. The unfounded worry was tokens would be prematurely cancelled if a
job launched sub-jobs and exited before the sub-jobs complete. Many years ago
I added reference counting to tokens to avoid that very problem.
The Curator child recipes watch for and fetch/cache new secrets & tokens. As
the number of uncanceled tokens grew, so did the number of node watches, size
of node listings (had to increase the response buffer!) to detect changes, zk
cpu load increased, quorum consistency had severe latency, etc. The tipping
point was the propagation time for the quorum exceeded the time to: request a
kms token, submit the job, RM getting a kerberos TGS, RM authenticating to kms.
Once the quorum is hundreds of milliseconds or more out of sync, 1 kms rejects
tokens issued by another kms in the bank.
That took 4 kms servers and many hundreds of thousands of tokens. The internal
mitigation was completely disabling the RM's "don't cancel tokens" setting.
> Support delegation token operations in KMS Benchmark
> ----------------------------------------------------
>
> Key: HADOOP-16130
> URL: https://issues.apache.org/jira/browse/HADOOP-16130
> Project: Hadoop Common
> Issue Type: Sub-task
> Affects Versions: 3.3.0
> Reporter: Wei-Chiu Chuang
> Assignee: George Huang
> Priority: Major
>
> At the last Hadoop Contributors Meetup, [~daryn] shared another KMS
> throughput bottleneck is ZooKeeper -- KMS uses ZK to store delegation tokens.
> ZK would be brought to a halt when expired delegation tokens are purged. That
> sounds critical especially given that in most deployments KMS share the same
> ZK quorum as HDFS, it would cause NameNode failover.
> The current KMS benchmark does not support delegation token operations
> (addDelegationTokens, cancelDelegationToken, renewDelegationToken) so it's
> hard to understand how bad it is, and hard to quantify the improvement of a
> fix.
> File this jira to support those operations before we move on to the fix for
> the ZK issue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]