[
https://issues.apache.org/jira/browse/HBASE-24117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128045#comment-17128045
]
Duo Zhang commented on HBASE-24117:
-----------------------------------
OK, I found the root cause.
Still something wrong with the shutdown code.
{noformat}
2020-05-08 15:57:51,275 INFO [M:0;localhost:51555]
assignment.AssignmentManager(287): Stopping assignment manager
2020-05-08 15:57:51,277 INFO [M:0;localhost:51555]
procedure2.RemoteProcedureDispatcher(113): Stopping procedure remote dispatcher
2020-05-08 15:57:51,277 INFO [PEWorker-4] procedure.ServerCrashProcedure(476):
pid=13, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
server=localhost,51560,1588978659320, splitWal=true, meta=false found a region
state=OFFLINE, location=null, table=Backoff,
region=3db619df21f59db7441495065e782264 which is no longer on us
localhost,51560,1588978659320, give up assigning...
{noformat}
In AssignmentManager.stop, we will cleanup the regionStates, and then we run
into SCP.assignRegions, we will recreate the RegionStateNode since it has
already been cleared. Then we will give up assigning by this check
{code}
// This is possible, as when a server is dead, TRSP will fail to
schedule a RemoteProcedure
// and then try to assign the region to a new RS. And before it has
updated the region
// location to the new RS, we may have already called the
am.getRegionsOnServer so we will
// consider the region is still on this crashed server. Then before we
arrive here, the
// TRSP could have updated the region location, or even finished
itself, so the region is
// no longer on this crashed server any more. We should not try to
assign it again. Please
// see HBASE-23594 for more details.
// UPDATE: HBCKServerCrashProcedure overrides isMatchingRegionLocation;
this check can get
// in the way of our clearing out 'Unknown Servers'.
if (!isMatchingRegionLocation(regionNode)) {
LOG.info("{} found {} whose regionLocation no longer matches {},
skipping assign...",
this, regionNode, serverName);
continue;
}
{code}
Typically, we should not shutdown AssignmentManager before all the procedures
are quited...
Let me think how to fix this...
> If move target RS crashes, move fails if concurrent master crash
> ----------------------------------------------------------------
>
> Key: HBASE-24117
> URL: https://issues.apache.org/jira/browse/HBASE-24117
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Reporter: Michael Stack
> Assignee: Michael Stack
> Priority: Major
> Attachments:
> org.apache.hadoop.hbase.master.assignment.TestCloseRegionWhileRSCrash-output.txt
>
>
> I saw this on TestCloseRegionWithRSCrash. The Region
> 788a516d1f86af98e0a16bcc1afe4fa1 was being moved to RS
> example.com,62652,1586032098445 just after it was killed. The Move Close
> fails because the RS has no node in the Master. The Move then tries to
> 'confirm' the close but it fails because no remote RS. We are then to wait in
> this state until operator or some other procedure intervenes to 'fix' the
> state. Normally a ServerCrashProcedure would do the job but in this test the
> Master is restarted after the RS is killed, a condition we do not accommodate.
> Let me attach the test log.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)