[
https://issues.apache.org/jira/browse/HBASE-19794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333295#comment-16333295
]
stack commented on HBASE-19794:
-------------------------------
While Jira was down I spent some time on this last night. The backup Master
tries to become active during cluster shutdown but only gets this far:
{code:java}
78612 Thread 1542 (M:1;asf903:32967):
78613 State: TIMED_WAITING
78614 Blocked count: 178
78615 Waited count: 389
78616 Stack:
78617 java.lang.Object.wait(Native Method)
78618
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
78619 org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
78620 org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
78621
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1117)
78622
org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1960)
78623
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131)
78624
org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573)
78625
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131)
78626
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219)
78627
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
78628 org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
78629 org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
78630
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:139)
78631
org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:276)
78632
org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:101)
78633
org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62)
78634
org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
78635
org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1059)
78636
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:921){code}
The backup Master will just be stuck here until all retries have been
exhausted. This is a variant on a issue seen elsewhere where client hosted in
server is trying to contact a server or region that is not going to show up,
usually because cluster is going down. We need means of signaling the client it
should give up because its host is going away. We probably also need to move
client communication off the main thread so the main thread remains available
and can react to shutdown.
Concurrent w/ my digging [~Apache9] was digging too and arrived at same place
(offline because Jira was down). He came up w/ a better workaround for now than
my cutting down on retries. He suggested minihbasecluster should put down
backup master's first, before we do the active Master (Thinking on it, it may
not work... damage may already have been done before we get to the shutdown
sequence... The backup master may have already started in on the shutdown
sequence).
Let me work up a patch based on Duo's
[https://github.com/Apache9/hbase/commit/97e030584504cc6019ef06462f6d44ca40125c45]
Let me add timeout, Duo's suggestion, and some other cleanup I came across
digging last night. Will also file issue to deal better w/ the root problem of
clients stuck in retry though cluster has been asked go down.
> TestZooKeeper hangs
> -------------------
>
> Key: HBASE-19794
> URL: https://issues.apache.org/jira/browse/HBASE-19794
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0-beta-2
>
> Attachments: org.apache.hadoop.hbase.TestZooKeeper-output.txt
>
>
> Seems like the TestZKAsyncRegistry that hangs in shutdown.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)