[
https://issues.apache.org/jira/browse/HBASE-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067372#comment-17067372
]
Michael Stack commented on HBASE-23808:
---------------------------------------
I just saw this bit in TestMasterShutdown:
{code}
// Switching to master registry exacerbated a race in the master bootstrap that
can result
// in a lost shutdown command (HBASE-8422, HBASE-23836). The race is
essentially because
// the server manager in HMaster is not initialized by the time
shutdown() RPC (below) is
// made to the master. The suspected reason as to why it was uncommon
before HBASE-18095
// is because the connection creation with ZK registry is so slow that
by then the server
// manager is usually init'ed in time for the RPC to be made. For now,
adding an explicit
// wait() in the test, waiting for the server manager to become
available.
final long timeout = TimeUnit.MINUTES.toMillis(10);
assertNotEquals("timeout waiting for server manager to become
available.",
-1, Waiter.waitFor(htu.getConfiguration(), timeout,
() -> masterThread.getMaster().getServerManager() != null...
{code}
... which probably explains the 'hang' I see.
In RSProcedureDispatcher#start, we were getting NPEs... which correlated to the
test fails. Above, I added catch and returning failed start which seemed to be
because Master had already been stopped. Made a subtask adding more debug for
now while tests run over night.
> [Flakey Test]
> TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer
> --------------------------------------------------------------------------------
>
> Key: HBASE-23808
> URL: https://issues.apache.org/jira/browse/HBASE-23808
> Project: HBase
> Issue Type: Test
> Components: test
> Affects Versions: 2.3.0
> Reporter: Nick Dimiduk
> Assignee: Nick Dimiduk
> Priority: Major
> Fix For: 3.0.0, 2.3.0, 2.2.4
>
> Attachments:
> TEST-org.apache.hadoop.hbase.master.TestMasterShutdown.xml
>
>
> Reproduces locally from time to time. Not much to go on here. Looks like the
> test is trying to do some fancy HBase cluster initialization order on top of
> a mini-cluster. Failure seems related to trying to start the HBase master
> before HDFS is fully initialized.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)