[
https://issues.apache.org/jira/browse/HADOOP-13653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528459#comment-15528459
]
Alex Ivanov commented on HADOOP-13653:
--------------------------------------
[~xiaochen], thank you for your comments. The zookeeper (ZK) cluster seemed
healthy, and nothing in any of the zookeeper logs indicated loss of quorum or
random disconnects.
Instead, it seems ZK connections became unstable after an accumulation of a
significant number of delegation tokens for KMS (>160,000). I'm not sure how
this caused the issue, but once we manually deleted the tokens, the disconnects
stopped. Once we apply the patch you provided for
[HADOOP-13487|https://issues.apache.org/jira/browse/HADOOP-13487] (thank you!),
I expect we'll be able to better manage the number of dtokens in ZK.
I do wish we were able to control some of the parameters for curator, so that
we can adjust the timeouts for our needs, and curtail the repetitive error
logging when a disconnect happens - these logs have taken up to 70GB of space
per day, which turns a single log viewing into a big data problem.
On a different note, in a situation with multiple KMS instances, you pointed
out how the {{LoadBalancingKMSClientProvider}} will try to find a working KMS.
The problem I've seen is the KMS client timeout seems quite long, so in the
case of one failed KMS, it takes a long time to talk to KMS from a client
perspective. Do you know how we can configure this behavior and have a shorter
timeout?
> ZKDelegationTokenSecretManager curator client seems to rapidly connect &
> disconnect from ZK
> -------------------------------------------------------------------------------------------
>
> Key: HADOOP-13653
> URL: https://issues.apache.org/jira/browse/HADOOP-13653
> Project: Hadoop Common
> Issue Type: Bug
> Components: kms
> Reporter: Alex Ivanov
> Priority: Critical
>
> During periods of time, KMS gets in a connect/disconnect loop from Zookeeper.
> It is not clear what causes the connection to be closed. I didn't see any
> issues on the ZK server side, so the issue must reside on client side.
> *Example errors*
> NOTE: I had to filter the logs heavily since they were many GB in size
> (thanks to curator error logging). What is left is an illustration of the
> delegation token creations, and the Zookeeper sessions getting lost and
> re-established over the course of 2 hours.
> {code}
> 2016-09-25 01:43:04,377 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [75027a21ab399aa7789d6907d70fadc4, 46]
> 2016-09-25 01:43:04,557 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [1106d0754d43dcf29324d7be737f51f0, 46]
> 2016-09-25 01:43:11,846 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [4426092c861f49c6ba0c60b49b9539e5, 46]
> 2016-09-25 01:43:48,974 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [a99efff2705d6489deb059098f18818f, 46]
> 2016-09-25 01:43:49,174 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [398b5962fd647880961ba5e86a77b414, 46]
> 2016-09-25 01:44:03,359 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [413187e62a21b5459422b5c524315d06, 46]
> 2016-09-25 01:44:03,625 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [7cc2c0d82edd40e7e6f6f40af20d04d3, 46]
> 2016-09-25 01:44:06,062 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [bd9394fce20607c12bc00104bea49284, 46]
> 2016-09-25 01:44:07,134 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [7dad3bd10526517e5e1cfccd2e96074a, 46]
> 2016-09-25 01:44:07,230 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [a712ed40687580647d070c9c7f525e15, 46]
> 2016-09-25 01:44:48,481 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [44bfefa31192c68e3cc053eec4e57e14, 46]
> 2016-09-25 01:44:48,522 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [67efc2aa65eeba701ad7d3d7bab51def, 46]
> 2016-09-25 01:44:50,259 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [b43e641f58dfbd2c72550ab6804f37d1, 46]
> 2016-09-25 01:44:54,271 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [ac2fbcf404c633759b75e6d6aae00e05, 46]
> 2016-09-25 01:44:56,141 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [cdbd224079a4a10400d00d0b8eece008, 46]
> 2016-09-25 01:45:01,328 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [e03218f4835524f3d05519d27bb04e35, 46]
> 2016-09-25 01:45:02,728 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [569ae6d666d584b6843fffc47a63d147, 46]
> 2016-09-25 01:45:02,832 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [c9048271483da234c12f75569b9513c6, 46]
> 2016-09-25 01:45:05,536 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [f519d621389e41b63e8d92b4cb15f832, 46]
> 2016-09-25 01:45:07,886 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [45cf6ba58b2bb348ac5e88fa18fe9dad, 46]
> 2016-09-25 01:47:24,346 WARN ConnectionState - Connection attempt
> unsuccessful after 66294 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> 2016-09-25 01:47:25,120 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [f160a865db69ef33548f146c9b3b84c6, 46]
> 2016-09-25 01:47:25,276 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [9d60add471464e01ef691c43bd901d96, 46]
> 2016-09-25 01:47:28,739 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [659ecabf02ff809736202a8484ff2be8, 46]
> 2016-09-25 01:48:33,233 WARN ConnectionState - Connection attempt
> unsuccessful after 64494 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> 2016-09-25 01:48:33,306 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [15b87dd6c2251177d8db9dda415d0e06, 46]
> 2016-09-25 01:48:33,459 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [306aa796017aab2b559bf503f81175e0, 46]
> 2016-09-25 01:48:34,665 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [6c47cc10edcf0931e7e26665f99dadc5, 46]
> 2016-09-25 01:49:40,669 WARN ConnectionState - Connection attempt
> unsuccessful after 66006 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> 2016-09-25 01:49:40,847 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [049585962d9ac1cdb4df17f826891130, 46]
> 2016-09-25 01:49:41,523 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [a0fd8cffe2ba40d63d4cd009aabb77bb, 46]
> 2016-09-25 01:49:44,811 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [ea94d6cc4df6621a38be9f79121b5cc6, 46]
> 2016-09-25 01:50:48,781 WARN ConnectionState - Connection attempt
> unsuccessful after 64005 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> ...
> ...
> 2016-09-25 03:39:51,312 WARN ConnectionState - Connection attempt
> unsuccessful after 60001 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> 2016-09-25 03:39:51,752 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [2b605504535c741de3d18ed30fe90d7c, 46]
> 2016-09-25 03:39:55,345 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [9dda97d42df3bf8ae805f1da2857bc33, 46]
> 2016-09-25 03:39:55,668 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [f52dcf38a4565a904be93fe3de92825d, 46]
> 2016-09-25 03:41:02,660 WARN ConnectionState - Connection attempt
> unsuccessful after 67005 (greater than max timeout of 60000). Resetting
> connection and trying again with a new connection.
> 2016-09-25 03:41:02,921 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [770057f6b1412f096a92ce6599c6422a, 46]
> 2016-09-25 03:41:03,035 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [f2b366612a15e0a450eac8b4cf556515, 46]
> 2016-09-25 03:41:05,903 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [a3b2d0684e1373fc2f9e5c8bff5e9939, 46]
> 2016-09-25 03:41:05,994 INFO AbstractDelegationTokenSecretManager - Creating
> password for identifier: [f9e72fd48c8aab0a737d2cc8319aa389, 46]
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]