[ 
https://issues.apache.org/jira/browse/HBASE-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Stepachev updated HBASE-11460:
-------------------------------------

    Attachment: threads.tdump

thread dump attached

> Deadlock in HMaster on masterAndZKLock in HConnectionManager
> ------------------------------------------------------------
>
>                 Key: HBASE-11460
>                 URL: https://issues.apache.org/jira/browse/HBASE-11460
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: Andrey Stepachev
>            Priority: Critical
>         Attachments: threads.tdump
>
>
> On one of our clusters we got a deadlock in HMaster.
> In a nutshell deadlock caused by using one HConnectionManager for serving 
> client-like calls and calls from HMaster RPC handlers.
> HBaseAdmin uses HConnectionManager which takes a lock masterAndZKLock.
> On the other side of this game sits TablesNamespaceManager (TNM). This class 
> uses HConnectionManager too (in my case for getting list of available 
> namespaces). 
> Problem is that HMaster class uses TNM  for serving RPC requests.
> If we look at TNM more closely, we can see, that this class is totally 
> synchronised.
> Thats gives us a problem.
> WebInterface calls request via HConnectionManager and locks masterAndZKLock.
> Connection is blocking, so RpcClient will spin, awaiting for reply (while 
> holding lock).
> That how it looks like in thread dump:
> {code}
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000000c8905430> (a 
> org.apache.hadoop.hbase.ipc.RpcClient$Call)
>       at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1435)
>       - locked <0x00000000c8905430> (a 
> org.apache.hadoop.hbase.ipc.RpcClient$Call)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1653)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1711)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1467)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2093)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1819)
>       - locked <0x00000000d15dc668> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin$MasterCallable.prepare(HBaseAdmin.java:3187)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
>       - locked <0x00000000cd0c1238> (a 
> org.apache.hadoop.hbase.client.RpcRetryingCaller)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:96)
>       - locked <0x00000000cd0c1238> (a 
> org.apache.hadoop.hbase.client.RpcRetryingCaller)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3214)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.listTableDescriptorsByNamespace(HBaseAdmin.java:2265)
> {code}
> Some other client call any HMaster RPC, and it calls TablesNamespaceManager 
> methods, which in turn will block on HConnectionManager global lock 
> masterAndZKLock.
> That how it looks like:
> {code}
>   java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveZooKeeperWatcher(HConnectionManager.java:1699)
>       - waiting to lock <0x00000000d15dc668> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:100)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isTableDisabled(HConnectionManager.java:874)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:1027)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
>       at 
> org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
>       - locked <0x00000000cd0ef108> (a 
> org.apache.hadoop.hbase.client.RpcRetryingCaller)
>       at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
>       at 
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
>       - locked <0x00000000d1b49fd8> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
>       at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:852)
>       at 
> org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:72)
>       at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:119)
>       - locked <0x00000000cd0ef248> (a 
> org.apache.hadoop.hbase.client.RpcRetryingCaller)
>       at org.apache.hadoop.hbase.client.HTable.get(HTable.java:756)
>       at 
> org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:134)
>       at 
> org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:118)
>       - locked <0x00000000d189da20> (a 
> org.apache.hadoop.hbase.master.TableNamespaceManager)
>       at 
> org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3113)
>       at 
> org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3133)
>       at 
> org.apache.hadoop.hbase.master.HMaster.listTableDescriptorsByNamespace(HMaster.java:3034)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38261)
>       at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175)
>       at 
> org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879)
> {code}
> And finally original handler, which should serve request from WebGUI can be 
> blocked on TNM methods effectively forming dead lock.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to