[jira] [Commented] (HBASE-24117) If move target RS crashes, move fails if concurrent master crash

Duo Zhang (Jira) Mon, 08 Jun 2020 02:05:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-24117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128045#comment-17128045
 ]


Duo Zhang commented on HBASE-24117:
-----------------------------------

OK, I found the root cause.

Still something wrong with the shutdown code.

{noformat}
2020-05-08 15:57:51,275 INFO  [M:0;localhost:51555] 
assignment.AssignmentManager(287): Stopping assignment manager
2020-05-08 15:57:51,277 INFO  [M:0;localhost:51555] 
procedure2.RemoteProcedureDispatcher(113): Stopping procedure remote dispatcher
2020-05-08 15:57:51,277 INFO  [PEWorker-4] procedure.ServerCrashProcedure(476): 
pid=13, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure 
server=localhost,51560,1588978659320, splitWal=true, meta=false found a region 
state=OFFLINE, location=null, table=Backoff, 
region=3db619df21f59db7441495065e782264 which is no longer on us 
localhost,51560,1588978659320, give up assigning...
{noformat}

In AssignmentManager.stop, we will cleanup the regionStates, and then we run 
into SCP.assignRegions, we will recreate the RegionStateNode since it has 
already been cleared. Then we will give up assigning by this check

{code}
        // This is possible, as when a server is dead, TRSP will fail to 
schedule a RemoteProcedure
        // and then try to assign the region to a new RS. And before it has 
updated the region
        // location to the new RS, we may have already called the 
am.getRegionsOnServer so we will
        // consider the region is still on this crashed server. Then before we 
arrive here, the
        // TRSP could have updated the region location, or even finished 
itself, so the region is
        // no longer on this crashed server any more. We should not try to 
assign it again. Please
        // see HBASE-23594 for more details.
        // UPDATE: HBCKServerCrashProcedure overrides isMatchingRegionLocation; 
this check can get
        // in the way of our clearing out 'Unknown Servers'.
        if (!isMatchingRegionLocation(regionNode)) {
          LOG.info("{} found {} whose regionLocation no longer matches {}, 
skipping assign...",
            this, regionNode, serverName);
          continue;
        }
{code}

Typically, we should not shutdown AssignmentManager before all the procedures 
are quited...

Let me think how to fix this...

> If move target RS crashes, move fails if concurrent master crash
> ----------------------------------------------------------------
>
>                 Key: HBASE-24117
>                 URL: https://issues.apache.org/jira/browse/HBASE-24117
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.hbase.master.assignment.TestCloseRegionWhileRSCrash-output.txt
>
>
> I saw this on TestCloseRegionWithRSCrash. The Region 
> 788a516d1f86af98e0a16bcc1afe4fa1 was being moved to RS  
> example.com,62652,1586032098445 just after it was killed. The Move Close 
> fails because the RS has no node in the Master. The Move then tries to 
> 'confirm' the close but it fails because no remote RS. We are then to wait in 
> this state until operator or some other procedure intervenes to 'fix' the 
> state. Normally a ServerCrashProcedure would do the job but in this test the 
> Master is restarted after the RS is killed, a condition we do not accommodate.
> Let me attach the test log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-24117) If move target RS crashes, move fails if concurrent master crash

Reply via email to