[jira] [Commented] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED

Josh Elser (JIRA) Wed, 30 Jan 2019 12:08:09 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756523#comment-16756523
 ]


Josh Elser commented on HBASE-21796:
------------------------------------

{quote}You are right that in the case that there are invalid/bad credentials, 
this wouldn't make sense as a fix.
{quote}
In the case of invalid credentials being provided (e.g. keytab that doesn't 
exist or wrong principal for the keytab), the client fails with NO_AUTH state, 
not an AUTH_FAILED state.
{noformat}
2019-01-30 14:58:15,091 INFO  [main-SendThread(hw13390.local:2181)] 
zookeeper.ClientCnxn: Opening socket connection to server 
hw13390.local/fe80:0:0:0:18be:934e:c223:9b8%4:2181. Will not attempt to 
authenticate using SASL (unknown error)
...
2019-01-30 14:58:15,120 WARN  [main] zookeeper.ZKUtil: 
regionserver:16202-0x1003009701e0028, quorum=hw13390.local:2181, 
baseZNode=/hbase-1.5-secure Unable to get data of znode 
/hbase-1.5-secure/running
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth 
for /hbase-1.5-secure/running
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1212)
  at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:396)
  at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:713)
  at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:689)
  at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:78)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:636)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2864)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
  at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2881){noformat}
I need to look more closely at when AUTH_FAILED would be thrown by the ZK 
server. I see one Jira issue that talks about it for Kerberos ticket renewals.

> RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED
> -----------------------------------------------------------------------
>
>                 Key: HBASE-21796
>                 URL: https://issues.apache.org/jira/browse/HBASE-21796
>             Project: HBase
>          Issue Type: Bug
>          Components: Zookeeper
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>             Fix For: 1.5.0
>
>         Attachments: HBASE-21796.001.branch-1.patch, 
> HBASE-21796.002.branch-1.patch
>
>
> We've observed the following situation inside of a RegionServer which leaves 
> an HConnection in a broken state as a result of the ZooKeeper client having 
> received an AUTH_FAILED case in the Phoenix secondary indexing code-path. The 
> result was that the HConnection used to write the secondary index updates 
> failed every time the client re-attempted the write but we had no outward 
> signs from the HConnection that there was a problem with that HConnection 
> instance.
> ZooKeeper programmer docs tell us that if a ZooKeeper instance goes to the 
> {{AUTH_FAILED}} state that we must open a new ZooKeeper instance: 
> [https://zookeeper.apache.org/doc/r3.4.13/zookeeperProgrammers.html#ch_zkSessions]
> When a new HConnection (or one without a cached meta location) tries to 
> access ZooKeeper to find meta's location or the cluster ID, this spin 
> indefinitely because we can never access ZooKeeper because our client is 
> broken from the AUTH_FAILED. For the Phoenix use-case (where we're trying to 
> use this HConnection within the RS), this breaks things pretty fast.
> The circumstances that caused us to observe this are not an HBase (or Phoenix 
> or ZooKeeper) problem. The AUTH_FAILED exception we see is a result of 
> networking issues on a user's system. Despite this, we can make our handling 
> of this situation better.
> We already have logic inside of RecoverableZooKeeper to re-create a ZooKeeper 
> object when we need one (e.g. session expired/closed). We can extend this 
> same logic to also re-create the ZK client object if we observe an 
> AUTH_FAILED state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21796) RecoverableZooKeeper indefinitely retries a client stuck in AUTH_FAILED

Reply via email to