[ https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774622#comment-16774622 ]
Ankit Singhal edited comment on HBASE-21796 at 2/22/19 12:33 AM: ----------------------------------------------------------------- need to look more closely at when AUTH_FAILED would be thrown by the ZK server. probably in our case, AUTH_FAILED is thrown when zk was not able to communicate with KDC. {noformat} client.ZooKeeperSaslClient: An error: (java.security .PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection refused (Connection refused))]) occur red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.{noformat} {quote}.003 adds some extra configuration properties ({{hbase.zookeeper.authfailed.retries.number}} default: 15, {{hbase.zookeeper.authfailed.pause}} default: 100) {quote} Can we use RetryCounter instead of using {noformat} + observedAuthFailed++; + if (observedAuthFailed > allowedAuthFailedRetries) { + throw new RuntimeException("Exceeded the configured retries for handling ZooKeeper" + + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")"); + } + // Avoid a fast retry loop. + if (LOG.isTraceEnabled()) { + LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper object after" + + " AUTH_FAILED state"); + } + TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat} was (Author: an...@apache.org): {quote}{quote} need to look more closely at when AUTH_FAILED would be thrown by the ZK server. {quote}{quote} probably in our case, AUTH_FAILED is thrown when zk was not able to communicate with KDC. client.ZooKeeperSaslClient: An error: (java.security .PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection refused (Connection refused))]) occur red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. {quote}.003 adds some extra configuration properties ({{hbase.zookeeper.authfailed.retries.number}} default: 15, {{hbase.zookeeper.authfailed.pause}} default: 100) {quote} Can we use RetryCounter instead of using {noformat} + observedAuthFailed++; + if (observedAuthFailed > allowedAuthFailedRetries) { + throw new RuntimeException("Exceeded the configured retries for handling ZooKeeper" + + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")"); + } + // Avoid a fast retry loop. + if (LOG.isTraceEnabled()) { + LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper object after" + + " AUTH_FAILED state"); + } + TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat} > RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED > ----------------------------------------------------------------------- > > Key: HBASE-21796 > URL: https://issues.apache.org/jira/browse/HBASE-21796 > Project: HBase > Issue Type: Bug > Components: Zookeeper > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Major > Fix For: 1.5.1 > > Attachments: HBASE-21796.001.branch-1.patch, > HBASE-21796.002.branch-1.patch, HBASE-21796.003.branch-1.patch > > > We've observed the following situation inside of a RegionServer which leaves > an HConnection in a broken state as a result of the ZooKeeper client having > received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The > result was that the HConnection used to write the secondary index updates > failed every time the client re-attempted the write but we had no outward > signs from the HConnection that there was a problem with that HConnection > instance. > ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the > {{AUTH_FAILED}} state that we must open a new ZooKeeper instance: > [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions] > When a new HConnection (or one without a cached meta location) tries to > access ZooKeeper to find meta's location or the cluster ID, this spin > indefinitely because we can never access ZooKeeper because our client is > broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to > use this HConnection within the RS), this breaks things pretty fast. > The circumstances that caused us to observe this are not an HBase (or Phoenix > or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of > networking issues on a user's system. Despite this, we can make our handling > of this situation better. > We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper > object when we need one (e.g. session expired/closed). We can extend this > same logic to also re-create the ZK client object if we observe an > AUTH_FAILED state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)