liwei created HBASE-7259:
----------------------------

             Summary: Deadlock in HBaseClient when KeeperException occured
                 Key: HBASE-7259
                 URL: https://issues.apache.org/jira/browse/HBASE-7259
             Project: HBase
          Issue Type: Bug
          Components: Zookeeper
    Affects Versions: 0.94.2, 0.94.1, 0.94.0
            Reporter: liwei
            Priority: Critical


HBaseClient was running after a period of time, all of get operation became too 
slow.

>From the client logs I could see the following:

1. Unable to get data of znode /hbase/root-region-server
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)
        at 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
        at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
        at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
        at 
org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
        at 
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
        at 
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
        at 
org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
        at 
org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
        at 
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
        at 
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)

2. jstack traces found one Java-level deadlock:

=============================

"catalina-exec-800":
  waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a 
java.lang.Object),
  which is held by "catalina-exec-710"
"catalina-exec-710":
  waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
  which is held by "catalina-exec-29-EventThread"
"catalina-exec-29-EventThread":
  waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
  which is held by "catalina-exec-710"
Java stack information for the threads listed above:

===================================================

"catalina-exec-800":
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
        - waiting to lock <0x0000000731902200> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
        at 
org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
        at 
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
        at 
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
"catalina-exec-710":
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
        - waiting to lock <0x00000007321f8708> (a 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
        - locked <0x0000000732a9c7e0> (a 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        at 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
        at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
        at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
        at 
org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
        at 
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
        at 
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
        at 
org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
        - locked <0x0000000731902200> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
        at 
org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
        at 
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
        at 
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
"catalina-exec-29-EventThread":
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
        - waiting to lock <0x0000000732a9c7e0> (a 
org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
        - locked <0x00000007321f8708> (a 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Found 1 deadlock.

>From the source code , the reason for this problem is doing 
>ZooKeeperNodeTracker.getData that has a KeeperException occured. And try to 
>resetZookeeperTracker. At the same time, ClientCnxn.EventThread  also do 
>resetZookeeperTracker ,too. Because of getData have already held the lock of  
>ZooKeeperNodeTracke , that lead to the order of the lock two threads to obtain 
>does not accord. So deadlock happened.

In order to avoid the problem, we can through reduce range of the lock of 
getData. 
See the patch with 0.94.0.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to