[jira] [Comment Edited] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED

Ankit Singhal (JIRA) Thu, 21 Feb 2019 16:35:29 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774622#comment-16774622
 ]


Ankit Singhal edited comment on HBASE-21796 at 2/22/19 12:33 AM:
-----------------------------------------------------------------

 

  need to look more closely at when AUTH_FAILED would be thrown by the ZK 
server.

 

probably in our case, AUTH_FAILED is thrown when zk was not able to communicate 
with KDC.
{noformat}
client.ZooKeeperSaslClient: An error: (java.security
.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism level: 
Connection refused (Connection refused))]) occur
red when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper 
Client will go to AUTH_FAILED state.{noformat}



{quote}.003 adds some extra configuration properties 
({{hbase.zookeeper.authfailed.retries.number}} default: 15, 
{{hbase.zookeeper.authfailed.pause}} default: 100)
{quote}
Can we use RetryCounter instead of using 
{noformat}
+ observedAuthFailed++;
+ if (observedAuthFailed > allowedAuthFailedRetries) {
+ throw new RuntimeException("Exceeded the configured retries for handling 
ZooKeeper"
+ + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")");
+ }
+ // Avoid a fast retry loop.
+ if (LOG.isTraceEnabled()) {
+ LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper 
object after"
+ + " AUTH_FAILED state");
+ }
+ TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat}


was (Author: an...@apache.org):
{quote}{quote}  need to look more closely at when AUTH_FAILED would be thrown 
by the ZK server.
{quote}{quote}
probably in our case, AUTH_FAILED is thrown when zk was not able to communicate 
with KDC.
client.ZooKeeperSaslClient: An error: (java.security
.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism level: 
Connection refused (Connection refused))]) occur
red when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper 
Client will go to AUTH_FAILED state.
{quote}.003 adds some extra configuration properties 
({{hbase.zookeeper.authfailed.retries.number}} default: 15, 
{{hbase.zookeeper.authfailed.pause}} default: 100)
{quote}
Can we use RetryCounter instead of using 
{noformat}
+ observedAuthFailed++;
+ if (observedAuthFailed > allowedAuthFailedRetries) {
+ throw new RuntimeException("Exceeded the configured retries for handling 
ZooKeeper"
+ + " AUTH_FAILED exceptions (" + allowedAuthFailedRetries + ")");
+ }
+ // Avoid a fast retry loop.
+ if (LOG.isTraceEnabled()) {
+ LOG.trace("Sleeping " + authFailedPause + "ms before re-creating ZooKeeper 
object after"
+ + " AUTH_FAILED state");
+ }
+ TimeUnit.MILLISECONDS.sleep(authFailedPause);{noformat}

> RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED
> -----------------------------------------------------------------------
>
>                 Key: HBASE-21796
>                 URL: https://issues.apache.org/jira/browse/HBASE-21796
>             Project: HBase
>          Issue Type: Bug
>          Components: Zookeeper
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>             Fix For: 1.5.1
>
>         Attachments: HBASE-21796.001.branch-1.patch, 
> HBASE-21796.002.branch-1.patch, HBASE-21796.003.branch-1.patch
>
>
> We've observed the following situation inside of a RegionServer which leaves 
> an HConnection in a broken state as a result of the ZooKeeper client having 
> received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The 
> result was that the HConnection used to write the secondary index updates 
> failed every time the client re-attempted the write but we had no outward 
> signs from the HConnection that there was a problem with that HConnection 
> instance.
> ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the 
> {{AUTH_FAILED}} state that we must open a new ZooKeeper instance: 
> [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions]
> When a new HConnection (or one without a cached meta location) tries to 
> access ZooKeeper to find meta's location or the cluster ID, this spin 
> indefinitely because we can never access ZooKeeper because our client is 
> broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to 
> use this HConnection within the RS), this breaks things pretty fast.
> The circumstances that caused us to observe this are not an HBase (or Phoenix 
> or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of 
> networking issues on a user's system. Despite this, we can make our handling 
> of this situation better.
> We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper 
> object when we need one (e.g. session expired/closed). We can extend this 
> same logic to also re-create the ZK client object if we observe an 
> AUTH_FAILED state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED

Reply via email to