[
https://issues.apache.org/jira/browse/HBASE-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249057#comment-17249057
]
Michael Stack commented on HBASE-25389:
---------------------------------------
The PR seems to fix the problem for me.
> [Flakey Tests] branch-2 TestMetaShutdownHandler
> -----------------------------------------------
>
> Key: HBASE-25389
> URL: https://issues.apache.org/jira/browse/HBASE-25389
> Project: HBase
> Issue Type: Task
> Components: flakies
> Reporter: Michael Stack
> Priority: Major
>
> I see this in local runs fail regularly. We kill the server hosting meta and
> then check it came up in a new location after waiting on recovery. In the
> test, when it fails, the assert on new location fails because we have not
> waited for the CRASH to happen. Here is excerpt from log:
> {code}
> 2020-12-11 13:20:27,298 INFO [Listener at localhost/62149]
> master.TestMetaShutdownHandler(111): Deleted the znode for the RegionServer
> hosting hbase:meta; waiting on SSH
> ...
> 2020-12-11 13:20:27,310 INFO [Listener at localhost/62149]
> master.TestMetaShutdownHandler(122): Past wait on RIT
> ...
> 2020-12-11 13:20:27,351 DEBUG [RegionServerTracker-0]
> procedure2.ProcedureExecutor(1048): Stored pid=9,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> stack.XXX.example.com,62201,1607721618377, splitWal=true, meta=true
> {code}
> The first line is where we remove the ephemeral node for the regionserver
> carrying hbase:meta. The second line is supposed to log AFTER SCP is done (it
> calls it SSH in this old test above). Notice how the 3rd line, after the 2nd,
> is first mention of SCP being queued.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)