[
https://issues.apache.org/jira/browse/HADOOP-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524379#comment-15524379
]
Xiao Chen commented on HADOOP-13652:
------------------------------------
Thanks for creating this, [~axenol]. Agreed the connection issue should be
tracked in a separate jira.
It looks to me the curator client is always created by
{{ZKSignerSecretProvider}}, and the block that read the configs in
{{ZKDelegationTokenSecretManager}} is not executed. i.e. {{CURATOR_TL.get() !=
null}} is always true.
Also seems
[ZKSSP|https://github.com/apache/hadoop/blob/4815d024c59cb029e2053d94c7aed33eb8053d3e/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/util/ZKSignerSecretProvider.java#L360]
and
[ZKDTSM|https://github.com/apache/hadoop/blob/4815d024c59cb029e2053d94c7aed33eb8053d3e/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/ZKDelegationTokenSecretManager.java#L205]
creates the curator client using different default retry policies. [~asuresh],
do you know more background on this? Is it intentional for some reason?
> ZKDelegationTokenSecretManager doesn't seem to honor ZK connection/session
> timeouts
> -----------------------------------------------------------------------------------
>
> Key: HADOOP-13652
> URL: https://issues.apache.org/jira/browse/HADOOP-13652
> Project: Hadoop Common
> Issue Type: Bug
> Components: kms
> Reporter: Alex Ivanov
>
> Looking at some of the errors I've seen due to Zookeeper connection issues
> from KMS, it doesn't seem like the following timeouts are picked up.
> {code}
> package org.apache.hadoop.security.token.delegation;
> public abstract class ZKDelegationTokenSecretManager<TokenIdent extends
> AbstractDelegationTokenIdentifier>
> extends AbstractDelegationTokenSecretManager<TokenIdent> {
> public static final int ZK_DTSM_ZK_SESSION_TIMEOUT_DEFAULT = 10000;
> public static final int ZK_DTSM_ZK_CONNECTION_TIMEOUT_DEFAULT = 10000;
> ...
> }
> {code}
> Instead, the connection/session timeouts are, correspondingly, 15 & 60 secs:
> the curator defaults.
> {code}
> package org.apache.curator.framework;
> public class CuratorFrameworkFactory
> {
> private static final int DEFAULT_SESSION_TIMEOUT_MS =
> Integer.getInteger("curator-default-session-timeout", 60 * 1000);
> private static final int DEFAULT_CONNECTION_TIMEOUT_MS =
> Integer.getInteger("curator-default-connection-timeout", 15 * 1000);
> ...
> }
> {code}
> It looks like DelegationTokenAuthenticationFilter is setting curator, and
> that may cause an issue:
> {code}
> package org.apache.hadoop.security.token.delegation.web;
> public class DelegationTokenAuthenticationFilter
> extends AuthenticationFilter {
> protected void initializeAuthHandler(String authHandlerClassName,
> FilterConfig filterConfig) throws ServletException {
> ZKDelegationTokenSecretManager.setCurator((CuratorFramework)
> filterConfig.getServletContext().getAttribute(ZKSignerSecretProvider.
> ZOOKEEPER_SIGNER_SECRET_PROVIDER_CURATOR_CLIENT_ATTRIBUTE));
> super.initializeAuthHandler(authHandlerClassName, filterConfig);
> ZKDelegationTokenSecretManager.setCurator(null);
> }
> ...
> }
> {code}
> Example errors:
> {code}
> 2016-09-25 01:46:33,053 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (15001)
> 2016-09-25 01:46:33,053 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (15001)
> 2016-09-25 01:46:34,028 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (15976)
> 2016-09-25 01:46:34,053 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (16001)
> 2016-09-25 01:46:37,053 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (19001)
> 2016-09-25 01:46:40,053 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (22001)
> 2016-09-25 01:46:49,055 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (31003)
> 2016-09-25 01:46:52,029 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (33977)
> 2016-09-25 01:47:05,344 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (47292)
> 2016-09-25 01:47:09,345 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (51292)
> 2016-09-25 01:47:24,346 WARN ConnectionState - Connection attempt
> unsuccessful after 66294 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> 2016-09-25 01:47:43,740 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (15001)
> 2016-09-25 01:47:43,740 ERROR ConnectionState - Connection timed out for
> connection string (host1, host2, host3) and timeout (15000) / elapsed (15001)
> {code}
> There are also some connection issues between KMS and Zookeeper. They are
> sporadic, that's why I'm still trying to pinpoint them, but essentially KMS
> can get into a perpetual connect/disconnect from Zookeeper cycle from which
> it eventually recovers or a restart also helps. I'm linking the relevant jira
> in case it is related to this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]