[
https://issues.apache.org/jira/browse/HDFS-15383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17247035#comment-17247035
]
Fengnan Li commented on HDFS-15383:
-----------------------------------
[~John Smith] It is a good question.
First of all, when the token is stale it will be deleted by the clean up
thread, thus when a client access this Router with a renewed token this Router
would not recognize it thus will load from ZK. The default scan interval is 1h,
which is long.
On the other hand, clients normally renew a token before it expires. For
example, Yarn renews a token when it reaches 92% (configurable, I forgot the
exact value) of the renew date, meaning when the client renews token, there are
still over 1 hour left for the token to be effective. Internally we set our
sync interval as 10min, so all Routers will be able to get the new renew date
in around 10min. In the meanwhile this is still a valid token, though there may
be different renew date on different Routers.
10 minutes is time for loading 1M tokens from zk to router memory in our env.
So theoretically your client will fail if you set the sync interval to be a
very large value like 2 hours, but we don't use such a big value in this poll
model. We can also make the deletion period shorter like every 15 mins to
further prevent the auth failures.
Hope it makes sense.
> RBF: Disable watch in ZKDelegationSecretManager for performance
> ---------------------------------------------------------------
>
> Key: HDFS-15383
> URL: https://issues.apache.org/jira/browse/HDFS-15383
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Fengnan Li
> Assignee: Fengnan Li
> Priority: Major
> Fix For: 3.4.0
>
>
> Based on the current design for delegation token in secure Router, the total
> number of watches for tokens is the product of number of routers and number
> of tokens, this is due to ZKDelegationTokenManager is using PathChildrenCache
> from curator, which automatically sets the watch and ZK will push the sync
> information to each router. There are some evaluations about the number of
> watches in Zookeeper has negative performance impact to Zookeeper server.
> In our practice when the number of watches exceeds 1.2 Million in a single ZK
> server there will be significant ZK performance degradation. Thus this ticket
> is to rewrite ZKDelegationTokenManagerImpl.java to explicitly disable the
> PathChildrenCache and have Routers sync periodically from Zookeeper. This has
> been working fine at the scale of 10 Routers with 2 million tokens.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]