[
https://issues.apache.org/jira/browse/HBASE-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Hofhansl resolved HBASE-7259.
----------------------------------
Resolution: Fixed
> Deadlock in HBaseClient when KeeperException occured
> ----------------------------------------------------
>
> Key: HBASE-7259
> URL: https://issues.apache.org/jira/browse/HBASE-7259
> Project: HBase
> Issue Type: Bug
> Components: Zookeeper
> Affects Versions: 0.94.0, 0.94.1, 0.94.2
> Reporter: liwei
> Priority: Critical
> Fix For: 0.94.4
>
> Attachments: 7259-0.94-branch.txt, HBASE-7259-0.94.2.txt
>
>
> HBaseClient was running after a period of time, all of get operation became
> too slow.
> From the client logs I could see the following:
> 1. Unable to get data of znode /hbase/root-region-server
> {code}
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:485)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)
> at
> org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
> at
> org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
> at
> org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
> at
> org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
> at
> org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
> at
> org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
> at
> org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
> at
> org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
> {code}
> 2. Catalina.out found one Java-level deadlock:
> {code}
> =============================
> "catalina-exec-800":
> waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a
> java.lang.Object),
> which is held by "catalina-exec-710"
> "catalina-exec-710":
> waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
> which is held by "catalina-exec-29-EventThread"
> "catalina-exec-29-EventThread":
> waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a
> org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
> which is held by "catalina-exec-710"
> Java stack information for the threads listed above:
> ===================================================
> "catalina-exec-800":
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
> - waiting to lock <0x0000000731902200> (a java.lang.Object)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
> at
> org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
> at
> org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
> at
> org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
> "catalina-exec-710":
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
> - waiting to lock <0x00000007321f8708> (a
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
> - locked <0x0000000732a9c7e0> (a
> org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
> at
> org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
> at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
> at
> org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
> at
> org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
> at
> org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
> at
> org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
> - locked <0x0000000731902200> (a java.lang.Object)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
> at
> org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
> at
> org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
> at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
> at
> org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
> "catalina-exec-29-EventThread":
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
> - waiting to lock <0x0000000732a9c7e0> (a
> org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
> - locked <0x00000007321f8708> (a
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
> Found 1 deadlock.
> {code}
> From the source code , the reason for this problem is doing
> ZooKeeperNodeTracker.getData that lead to KeeperException. And try to
> resetZookeeperTracker. At the same time, ClientCnxn.EventThread also do
> resetZookeeperTracker ,too. Because of getData have already held the lock of
> ZooKeeperNodeTracke , that lead to the order of the lock two threads to
> obtain does not accord. So deadlock happened.
> In order to avoid the problem, we can add if reseting condition in
> abortable.abort()
> See the patch.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira