Michael Stack commented on HBASE-23808:

I just saw this bit in TestMasterShutdown:

// Switching to master registry exacerbated a race in the master bootstrap that 
can result
        // in a lost shutdown command (HBASE-8422, HBASE-23836). The race is 
essentially because
        // the server manager in HMaster is not initialized by the time 
shutdown() RPC (below) is
        // made to the master. The suspected reason as to why it was uncommon 
before HBASE-18095
        // is because the connection creation with ZK registry is so slow that 
by then the server
        // manager is usually init'ed in time for the RPC to be made. For now, 
adding an explicit
        // wait() in the test, waiting for the server manager to become 
        final long timeout = TimeUnit.MINUTES.toMillis(10);
        assertNotEquals("timeout waiting for server manager to become 
          -1, Waiter.waitFor(htu.getConfiguration(), timeout,
            () -> masterThread.getMaster().getServerManager() != null...

... which probably explains the 'hang' I see.

In RSProcedureDispatcher#start, we were getting NPEs... which correlated to the 
test fails. Above, I added catch and returning failed start which seemed to be 
because Master had already been stopped. Made a subtask adding more debug for 
now while tests run over night.

> [Flakey Test] 
> TestMasterShutdown#testMasterShutdownBeforeStartingAnyRegionServer
> --------------------------------------------------------------------------------
>                 Key: HBASE-23808
>                 URL: https://issues.apache.org/jira/browse/HBASE-23808
>             Project: HBase
>          Issue Type: Test
>          Components: test
>    Affects Versions: 2.3.0
>            Reporter: Nick Dimiduk
>            Assignee: Nick Dimiduk
>            Priority: Major
>             Fix For: 3.0.0, 2.3.0, 2.2.4
>         Attachments: 
> TEST-org.apache.hadoop.hbase.master.TestMasterShutdown.xml
> Reproduces locally from time to time. Not much to go on here. Looks like the 
> test is trying to do some fancy HBase cluster initialization order on top of 
> a mini-cluster. Failure seems related to trying to start the HBase master 
> before HDFS is fully initialized.

This message was sent by Atlassian Jira

Reply via email to