[
https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756355#comment-16756355
]
Josh Elser commented on HBASE-21796:
------------------------------------
{quote}why are we trying to reconnect on AUTH_FAILED? Re-create a ZooKeeper
object when session expired/closed make sense, but in case of AUTH_FAIL it will
always fail as credential will be wrong
{quote}
The core issue is that an AUTH_FAIL might not always be a deterministic
failure. In the situation we observed, the ZK quorum would return the failure
only sometimes. You are right that in the case that there are invalid/bad
credentials, this wouldn't make sense as a fix.
I'm not sure how to reconcile these (or how to differentiate between them)
which makes this a tricky issue.
> RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED
> -----------------------------------------------------------------------
>
> Key: HBASE-21796
> URL: https://issues.apache.org/jira/browse/HBASE-21796
> Project: HBase
> Issue Type: Bug
> Components: Zookeeper
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-21796.001.branch-1.patch,
> HBASE-21796.002.branch-1.patch
>
>
> We've observed the following situation inside of a RegionServer which leaves
> an HConnection in a broken state as a result of the ZooKeeper client having
> received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The
> result was that the HConnection used to write the secondary index updates
> failed every time the client re-attempted the write but we had no outward
> signs from the HConnection that there was a problem with that HConnection
> instance.
> ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the
> {{AUTH_FAILED}} state that we must open a new ZooKeeper instance:
> [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions]
> When a new HConnection (or one without a cached meta location) tries to
> access ZooKeeper to find meta's location or the cluster ID, this spin
> indefinitely because we can never access ZooKeeper because our client is
> broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to
> use this HConnection within the RS), this breaks things pretty fast.
> The circumstances that caused us to observe this are not an HBase (or Phoenix
> or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of
> networking issues on a user's system. Despite this, we can make our handling
> of this situation better.
> We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper
> object when we need one (e.g. session expired/closed). We can extend this
> same logic to also re-create the ZK client object if we observe an
> AUTH_FAILED state.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)