[ 
https://issues.apache.org/jira/browse/HADOOP-13653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528459#comment-15528459
 ] 

Alex Ivanov commented on HADOOP-13653:
--------------------------------------

[~xiaochen], thank you for your comments. The zookeeper (ZK) cluster seemed 
healthy, and nothing in any of the zookeeper logs indicated loss of quorum or 
random disconnects.
Instead, it seems ZK connections became unstable after an accumulation of a 
significant number of delegation tokens for KMS (>160,000). I'm not sure how 
this caused the issue, but once we manually deleted the tokens, the disconnects 
stopped. Once we apply the patch you provided for 
[HADOOP-13487|https://issues.apache.org/jira/browse/HADOOP-13487] (thank you!), 
I expect we'll be able to better manage the number of dtokens in ZK.

I do wish we were able to control some of the parameters for curator, so that 
we can adjust the timeouts for our needs, and curtail the repetitive error 
logging when a disconnect happens - these logs have taken up to 70GB of space 
per day, which turns a single log viewing into a big data problem.

On a different note, in a situation with multiple KMS instances, you pointed 
out how the {{LoadBalancingKMSClientProvider}} will try to find a working KMS. 
The problem I've seen is the KMS client timeout seems quite long, so in the 
case of one failed KMS, it takes a long time to talk to KMS from a client 
perspective. Do you know how we can configure this behavior and have a shorter 
timeout?

> ZKDelegationTokenSecretManager curator client seems to rapidly connect & 
> disconnect from ZK
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13653
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13653
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: kms
>            Reporter: Alex Ivanov
>            Priority: Critical
>
> During periods of time, KMS gets in a connect/disconnect loop from Zookeeper. 
> It is not clear what causes the connection to be closed. I didn't see any 
> issues on the ZK server side, so the issue must reside on client side.
> *Example errors*
> NOTE: I had to filter the logs heavily since they were many GB in size 
> (thanks to curator error logging). What is left is an illustration of the 
> delegation token creations, and the Zookeeper sessions getting lost and 
> re-established over the course of 2 hours.
> {code}
> 2016-09-25 01:43:04,377 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [75027a21ab399aa7789d6907d70fadc4, 46]
> 2016-09-25 01:43:04,557 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [1106d0754d43dcf29324d7be737f51f0, 46]
> 2016-09-25 01:43:11,846 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [4426092c861f49c6ba0c60b49b9539e5, 46]
> 2016-09-25 01:43:48,974 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [a99efff2705d6489deb059098f18818f, 46]
> 2016-09-25 01:43:49,174 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [398b5962fd647880961ba5e86a77b414, 46]
> 2016-09-25 01:44:03,359 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [413187e62a21b5459422b5c524315d06, 46]
> 2016-09-25 01:44:03,625 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [7cc2c0d82edd40e7e6f6f40af20d04d3, 46]
> 2016-09-25 01:44:06,062 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [bd9394fce20607c12bc00104bea49284, 46]
> 2016-09-25 01:44:07,134 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [7dad3bd10526517e5e1cfccd2e96074a, 46]
> 2016-09-25 01:44:07,230 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [a712ed40687580647d070c9c7f525e15, 46]
> 2016-09-25 01:44:48,481 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [44bfefa31192c68e3cc053eec4e57e14, 46]
> 2016-09-25 01:44:48,522 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [67efc2aa65eeba701ad7d3d7bab51def, 46]
> 2016-09-25 01:44:50,259 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [b43e641f58dfbd2c72550ab6804f37d1, 46]
> 2016-09-25 01:44:54,271 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [ac2fbcf404c633759b75e6d6aae00e05, 46]
> 2016-09-25 01:44:56,141 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [cdbd224079a4a10400d00d0b8eece008, 46]
> 2016-09-25 01:45:01,328 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [e03218f4835524f3d05519d27bb04e35, 46]
> 2016-09-25 01:45:02,728 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [569ae6d666d584b6843fffc47a63d147, 46]
> 2016-09-25 01:45:02,832 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [c9048271483da234c12f75569b9513c6, 46]
> 2016-09-25 01:45:05,536 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [f519d621389e41b63e8d92b4cb15f832, 46]
> 2016-09-25 01:45:07,886 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [45cf6ba58b2bb348ac5e88fa18fe9dad, 46]
> 2016-09-25 01:47:24,346 WARN  ConnectionState - Connection attempt 
> unsuccessful after 66294 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection.
> 2016-09-25 01:47:25,120 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [f160a865db69ef33548f146c9b3b84c6, 46]
> 2016-09-25 01:47:25,276 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [9d60add471464e01ef691c43bd901d96, 46]
> 2016-09-25 01:47:28,739 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [659ecabf02ff809736202a8484ff2be8, 46]
> 2016-09-25 01:48:33,233 WARN  ConnectionState - Connection attempt 
> unsuccessful after 64494 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection.
> 2016-09-25 01:48:33,306 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [15b87dd6c2251177d8db9dda415d0e06, 46]
> 2016-09-25 01:48:33,459 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [306aa796017aab2b559bf503f81175e0, 46]
> 2016-09-25 01:48:34,665 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [6c47cc10edcf0931e7e26665f99dadc5, 46]
> 2016-09-25 01:49:40,669 WARN  ConnectionState - Connection attempt 
> unsuccessful after 66006 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection.
> 2016-09-25 01:49:40,847 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [049585962d9ac1cdb4df17f826891130, 46]
> 2016-09-25 01:49:41,523 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [a0fd8cffe2ba40d63d4cd009aabb77bb, 46]
> 2016-09-25 01:49:44,811 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [ea94d6cc4df6621a38be9f79121b5cc6, 46]
> 2016-09-25 01:50:48,781 WARN  ConnectionState - Connection attempt 
> unsuccessful after 64005 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection.
> ...
> ...
> 2016-09-25 03:39:51,312 WARN  ConnectionState - Connection attempt 
> unsuccessful after 60001 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection.
> 2016-09-25 03:39:51,752 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [2b605504535c741de3d18ed30fe90d7c, 46]
> 2016-09-25 03:39:55,345 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [9dda97d42df3bf8ae805f1da2857bc33, 46]
> 2016-09-25 03:39:55,668 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [f52dcf38a4565a904be93fe3de92825d, 46]
> 2016-09-25 03:41:02,660 WARN  ConnectionState - Connection attempt 
> unsuccessful after 67005 (greater than max timeout of 60000). Resetting 
> connection and trying again with a new connection.
> 2016-09-25 03:41:02,921 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [770057f6b1412f096a92ce6599c6422a, 46]
> 2016-09-25 03:41:03,035 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [f2b366612a15e0a450eac8b4cf556515, 46]
> 2016-09-25 03:41:05,903 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [a3b2d0684e1373fc2f9e5c8bff5e9939, 46]
> 2016-09-25 03:41:05,994 INFO  AbstractDelegationTokenSecretManager - Creating 
> password for identifier: [f9e72fd48c8aab0a737d2cc8319aa389, 46]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to