Rushabh Shah created HBASE-27957:
------------------------------------
Summary: HConnection (and ZookeeprWatcher threads) leak in case of
AUTH_FAILED exception.
Key: HBASE-27957
URL: https://issues.apache.org/jira/browse/HBASE-27957
Project: HBase
Issue Type: Bug
Affects Versions: 2.4.17, 1.7.2
Reporter: Rushabh Shah
Observed this in production environment running some version of 1.7 release.
Application didn't had the right keytab setup for authentication. Application
was trying to create HConnection and zookeeper server threw AUTH_FAILED
exception.
After few hours of application in this state, saw thousands of
zk-event-processor thread with below stack trace.
{noformat}
"zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms
elapsed=41794.58s tid=0x00007fd7805066d0 nid=0x1245 waiting on condition
[0x00007fd75df01000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007fd9874a85e0> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:2081)
at
java.util.concurrent.LinkedBlockingQueue.take([email protected]/LinkedBlockingQueue.java:433)
at
java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1054)
at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
{noformat}
{code:java|title=ConnectionManager.java|borderStyle=solid}
HConnectionImplementation(Configuration conf, boolean managed,
ExecutorService pool, User user, String clusterId) throws IOException {
...
...
try {
this.registry = setupRegistry();
retrieveClusterId();
...
...
} catch (Throwable e) {
// avoid leaks: registry, rpcClient, ...
LOG.debug("connection construction failed", e);
close();
throw e;
}
{code}
retrieveClusterId internally calls ZKConnectionRegistry#getClusterId
{code:java|title=ZKConnectionRegistry.java|borderStyle=solid}
private String clusterId = null;
@Override
public String getClusterId() {
if (this.clusterId != null) return this.clusterId;
// No synchronized here, worse case we will retrieve it twice, that's
// not an issue.
try (ZooKeeperKeepAliveConnection zkw = hci.getKeepAliveZooKeeperWatcher())
{
this.clusterId = ZKClusterId.readClusterIdZNode(zkw);
if (this.clusterId == null) {
LOG.info("ClusterId read in ZooKeeper is null");
}
} catch (KeeperException | IOException e) { ---> WE ARE SWALLOWING
THIS EXCEPTION AND RETURNING NULL.
LOG.warn("Can't retrieve clusterId from Zookeeper", e);
}
return this.clusterId;
}
{code}
ZkConnectionRegistry#getClusterId threw the following exception.(Our logging
system trims stack traces longer than 5 lines.)
{noformat}
Cause: org.apache.zookeeper.KeeperException$AuthFailedException:
KeeperErrorCode = AuthFailed for /hbase/hbaseid
StackTrace:
org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213)
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285)
org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
{noformat}
We should throw KeeperException from ZKConnectionRegistry#getClusterId all the
way back to HConnectionImplementation constructor to close all the watcher
threads and throw the exception back to the caller.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)