[ https://issues.apache.org/jira/browse/HBASE-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818561#comment-13818561 ]
Lars Hofhansl commented on HBASE-9939: -------------------------------------- Although one would expect the callable itself to timeout and throw an exception. Since you unplugged the network you're probably mostly seeing the ZK session's attempts to be reestablished. I've seen this with our clients; we observed that if the entire cluster is down it takes (with the 0.94 defaults) up to 20 mins before after the various retries the client eventually times out. Some of this was improved by avoiding nested retry loops (see HBASE-6326), but it still takes a long time with the default. In our systems we use different ZK timeouts and retry count in the server (where this is used for server to server communication) and in the client (where we prefer fast timeouts so that we do not tie up our AppServer threads). This looks a bit different though: {code} "hbase-tablepool-7-thread-4" id=43 idx=0xc0 tid=22572 prio=5 alive, waiting, native_blocked, daemon -- Waiting for notification on: org/apache/hadoop/hbase/ipc/HBaseClient$Call@0x00000000058F1F38[fat lock] at jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method) at java/lang/Object.wait(J)V(Native Method) at java/lang/Object.wait(Object.java:485) at org/apache/hadoop/hbase/ipc/HBaseClient.call(HBaseClient.java:981) ^-- Lock released while waiting: org/apache/hadoop/hbase/ipc/HBaseClient$Call@0x00000000058F1F38[fat lock] at org/apache/hadoop/hbase/ipc/SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:104) at $Proxy7.multi(Lorg/apache/hadoop/hbase/client/MultiAction;)Lorg/apache/hadoop/hbase/client/MultiResponse;(Unknown Source) at org/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1398) at org/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1396) at org/apache/hadoop/hbase/client/ServerCallable.withoutRetries(ServerCallable.java:210) at org/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1405) {code} So we need to look into this. > All HBase client threads are locked out on network failure > ---------------------------------------------------------- > > Key: HBASE-9939 > URL: https://issues.apache.org/jira/browse/HBASE-9939 > Project: HBase > Issue Type: Bug > Components: Client > Affects Versions: 0.94.6 > Reporter: Rohit Joshi > Fix For: 0.94.14 > > > Under load when I disabled network interface, all HBase threads were locked > out. I was expecting these threads to be released based on > client.operation.timeout and rpc,timeout. > Here is a link for thread dump. > https://www.dropbox.com/s/y1ng3yoywq09x2u/HBaseClient_Threaddump.txt -- This message was sent by Atlassian JIRA (v6.1#6144)