[ 
https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199654#comment-14199654
 ] 

Qiang Tian commented on HBASE-12336:
------------------------------------

Hi [~stack], 
As I understand, the zookeeper-2012 could apply to this issue as well.
the root of the problem is zk uses 2 queues for request handling. when a packet 
is not on one of the 2 queues. the exception in send thread(in this case, could 
be due to cluster restarted?) will just ignore the packet, so the main thread 
will never get response and hang there. But we need more data for proof.. (so 
far the occurrence is rare..)
thanks.


> RegionServer failed to shutdown for NodeFailoverWorker thread
> -------------------------------------------------------------
>
>                 Key: HBASE-12336
>                 URL: https://issues.apache.org/jira/browse/HBASE-12336
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.11
>            Reporter: Liu Shaohui
>            Assignee: Liu Shaohui
>            Priority: Minor
>             Fix For: 2.0.0, 0.94.26, 0.98.9, 0.99.2
>
>         Attachments: HBASE-12336-trunk-v1.diff, stack
>
>
> After enabling hbase.zookeeper.useMulti in hbase cluster, we found that 
> regionserver failed to shutdown. Other threads have exited except a 
> NodeFailoverWorker thread.
> {code}
> "ReplicationExecutor-0" prio=10 tid=0x00007f0d40195ad0 nid=0x73a in 
> Object.wait() [0x00007f0dc8fe6000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:485)
>         at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
>         - locked <0x00000005a16df080> (a 
> org.apache.zookeeper.ClientCnxn$Packet)
>         at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930)
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912)
>         at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {code}
> It's sure that the shutdown method of the executor is called in  
> ReplicationSourceManager#join.
>  
> I am looking for the root cause and suggestions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to