[
https://issues.apache.org/jira/browse/HBASE-27957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani updated HBASE-27957:
---------------------------------
Component/s: Client
> HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED
> exception.
> --------------------------------------------------------------------------------
>
> Key: HBASE-27957
> URL: https://issues.apache.org/jira/browse/HBASE-27957
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 1.7.2, 2.4.17
> Reporter: Rushabh Shah
> Priority: Critical
>
> Observed this in production environment running some version of 1.7 release.
> Application didn't had the right keytab setup for authentication. Application
> was trying to create HConnection and zookeeper server threw AUTH_FAILED
> exception.
> After few hours of application in this state, saw thousands of
> zk-event-processor thread with below stack trace.
> {noformat}
> "zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms
> elapsed=41794.58s tid=0x00007fd7805066d0 nid=0x1245 waiting on condition
> [0x00007fd75df01000]
> java.lang.Thread.State: WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native
> Method)
> - parking to wait for <0x00007fd9874a85e0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:2081)
> at
> java.util.concurrent.LinkedBlockingQueue.take([email protected]/LinkedBlockingQueue.java:433)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1054)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1114)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> {noformat}
> {code:java|title=ConnectionManager.java|borderStyle=solid}
> HConnectionImplementation(Configuration conf, boolean managed,
> ExecutorService pool, User user, String clusterId) throws IOException
> {
> ...
> ...
> try {
> this.registry = setupRegistry();
> retrieveClusterId();
> ...
> ...
> } catch (Throwable e) {
> // avoid leaks: registry, rpcClient, ...
> LOG.debug("connection construction failed", e);
> close();
> throw e;
> }
> {code}
> retrieveClusterId internally calls ZKConnectionRegistry#getClusterId
> {code:java|title=ZKConnectionRegistry.java|borderStyle=solid}
> private String clusterId = null;
> @Override
> public String getClusterId() {
> if (this.clusterId != null) return this.clusterId;
> // No synchronized here, worse case we will retrieve it twice, that's
> // not an issue.
> try (ZooKeeperKeepAliveConnection zkw =
> hci.getKeepAliveZooKeeperWatcher()) {
> this.clusterId = ZKClusterId.readClusterIdZNode(zkw);
> if (this.clusterId == null) {
> LOG.info("ClusterId read in ZooKeeper is null");
> }
> } catch (KeeperException | IOException e) { ---> WE ARE SWALLOWING
> THIS EXCEPTION AND RETURNING NULL.
> LOG.warn("Can't retrieve clusterId from Zookeeper", e);
> }
> return this.clusterId;
> }
> {code}
> ZkConnectionRegistry#getClusterId threw the following exception.(Our logging
> system trims stack traces longer than 5 lines.)
> {noformat}
> Cause: org.apache.zookeeper.KeeperException$AuthFailedException:
> KeeperErrorCode = AuthFailed for /hbase/hbaseid
> StackTrace:
> org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213)
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285)
> org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
> {noformat}
> We should throw KeeperException from ZKConnectionRegistry#getClusterId all
> the way back to HConnectionImplementation constructor to close all the
> watcher threads and throw the exception back to the caller.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)