Andrey Stepachev created HBASE-11460:
----------------------------------------

             Summary: Deadlock in HMaster on masterAndZKLock in 
HConnectionManager
                 Key: HBASE-11460
                 URL: https://issues.apache.org/jira/browse/HBASE-11460
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.96.0
            Reporter: Andrey Stepachev
            Priority: Critical
         Attachments: threads.tdump

On one of our clusters we got a deadlock in HMaster.
In a nutshell deadlock caused by using one HConnectionManager for serving 
client-like calls and calls from HMaster RPC handlers.

HBaseAdmin uses HConnectionManager which takes a lock masterAndZKLock.
On the other side of this game sits TablesNamespaceManager (TNM). This class 
uses HConnectionManager too (in my case for getting list of available 
namespaces). 
Problem is that HMaster class uses TNM  for serving RPC requests.
If we look at TNM more closely, we can see, that this class is totally 
synchronised.

Thats gives us a problem.

WebInterface calls request via HConnectionManager and locks masterAndZKLock.
Connection is blocking, so RpcClient will spin, awaiting for reply (while 
holding lock).
That how it looks like in thread dump:
{code}
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000c8905430> (a 
org.apache.hadoop.hbase.ipc.RpcClient$Call)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1435)
        - locked <0x00000000c8905430> (a 
org.apache.hadoop.hbase.ipc.RpcClient$Call)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
        at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1467)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2093)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1819)
        - locked <0x00000000d15dc668> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin$MasterCallable.prepare(HBaseAdmin.java:3187)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
        - locked <0x00000000cd0c1238> (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:96)
        - locked <0x00000000cd0c1238> (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3214)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.listTableDescriptorsByNamespace(HBaseAdmin.java:2265)
{code}

Some other client call any HMaster RPC, and it calls TablesNamespaceManager 
methods, which in turn will block on HConnectionManager global lock 
masterAndZKLock.
That how it looks like:

{code}
  java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1699)
        - waiting to lock <0x00000000d15dc668> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:100)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isTableDisabled(HConnectionManager.java:874)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1027)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
        at 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
        - locked <0x00000000cd0ef108> (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
        at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
        - locked <0x00000000d1b49fd8> (a java.lang.Object)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
        at 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
        - locked <0x00000000cd0ef248> (a 
org.apache.hadoop.hbase.client.RpcRetryingCaller)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:756)
        at 
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:134)
        at 
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:118)
        - locked <0x00000000d189da20> (a 
org.apache.hadoop.hbase.master.TableNamespaceManager)
        at 
org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3113)
        at 
org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3133)
        at 
org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3034)
        at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38261)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
        at 
org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
{code}

And finally original handler, which should serve request from WebGUI can be 
blocked on TNM methods effectively forming dead lock.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to