[jira] [Commented] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED

Josh Elser (JIRA) Thu, 31 Jan 2019 08:50:41 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757470#comment-16757470
 ]


Josh Elser commented on HBASE-21796:
------------------------------------

{quote}The only question remaining is whether AUTH_FAILED is a non-recoverable 
event that we should immediately throw back, or is it (at least in some cases) 
recoverable.
{quote}
Thanks Enis. I think it is recoverable in some cases, but I'm not sure what 
percentage of cases. I think the situation in which we saw this issue arise is 
a specific user's problem rather than something we particularly want to assume 
is "normal" for the average user.

I'd think that bounding the number of times we try to recover an AUTH_FAILED 
would be something to help make sure we still eventually fail (in cases where 
we can't recover). Maybe that's a middle-ground?

> RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED
> -----------------------------------------------------------------------
>
>                 Key: HBASE-21796
>                 URL: https://issues.apache.org/jira/browse/HBASE-21796
>             Project: HBase
>          Issue Type: Bug
>          Components: Zookeeper
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>             Fix For: 1.5.0
>
>         Attachments: HBASE-21796.001.branch-1.patch, 
> HBASE-21796.002.branch-1.patch
>
>
> We've observed the following situation inside of a RegionServer which leaves 
> an HConnection in a broken state as a result of the ZooKeeper client having 
> received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The 
> result was that the HConnection used to write the secondary index updates 
> failed every time the client re-attempted the write but we had no outward 
> signs from the HConnection that there was a problem with that HConnection 
> instance.
> ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the 
> {{AUTH_FAILED}} state that we must open a new ZooKeeper instance: 
> [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions]
> When a new HConnection (or one without a cached meta location) tries to 
> access ZooKeeper to find meta's location or the cluster ID, this spin 
> indefinitely because we can never access ZooKeeper because our client is 
> broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to 
> use this HConnection within the RS), this breaks things pretty fast.
> The circumstances that caused us to observe this are not an HBase (or Phoenix 
> or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of 
> networking issues on a user's system. Despite this, we can make our handling 
> of this situation better.
> We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper 
> object when we need one (e.g. session expired/closed). We can extend this 
> same logic to also re-create the ZK client object if we observe an 
> AUTH_FAILED state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED

Reply via email to