[
https://issues.apache.org/jira/browse/HBASE-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519926#comment-17519926
]
Duo Zhang commented on HBASE-26914:
-----------------------------------
The behavior is expected. It will try to assign the region again. Later after
we schedule a SCP, the SCP will interrupt this TRSP and let it assign the
region to another region server.
> Rebuild a cluster from an existing root directory may hang
> ----------------------------------------------------------
>
> Key: HBASE-26914
> URL: https://issues.apache.org/jira/browse/HBASE-26914
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 3.0.0-alpha-2
> Reporter: LiangJun He
> Assignee: LiangJun He
> Priority: Major
> Fix For: 3.0.0-alpha-2
>
>
> After HBASE-26245, we can rebuild a cluster from an existing root directory
> on cloud environment, but we still have a problem, if we stop the
> regionservers of the old cluster first, and then stop the masters, new
> cluster(rebuild from old cluster's root directory) may hang.
> This problem is also described in HBASE-26898.
>
> Hang Reason:
> For example, if RS1 is stopped first, the Master will expire RS1, and will
> generate a ServerCrashProcedure to reassign the region on RS1 (at the same
> time, the Master will delete the RS1 node information saved in the local
> region), and then generate subProcedure TransitRegionStateProcedure for
> reassignment, that is planned to assign the region to RS2 (RS2 has not been
> stopped at this time), but then RS2 is stopped, and the Master is also
> stopped.
> Rebuild a cluster from an existing root directory(from old cluster on OSS),
> the Master will recover from the Procedure state and continue to execute the
> unfinished Procedure. At this time, there is an uncompleted
> TransitRegionStateProcedure, and the target-RS to be assigned to the region
> is the RS2 of the old cluster, it will report warn message:
> {code:java}
> 2022-04-09 14:44:00,050 INFO [master/emr-header-1:16000:becomeActiveMaster]
> procedure.MasterProcedureScheduler: Took xlock for pid=33950, ppid=33615,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
> TransitRegionStateProcedure table=usertable,
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN
> 2022-04-09 14:44:00,241 INFO [master/emr-header-1:16000:becomeActiveMaster]
> assignment.AssignmentManager: Attach pid=33950, ppid=33615,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false;
> TransitRegionStateProcedure table=usertable,
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN to state=OFFLINE,
> location=null, table=usertable, region=5c0dac3f7410e8c91e00bcfcdc6e774a to
> restore RIT
> 2022-04-09 14:44:11,678 WARN [PEWorker-1]
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=35353,
> ppid=33950, state=RUNNABLE, hasLock=true; OpenRegionProcedure
> 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842 for region
> {ENCODED => 5c0dac3f7410e8c91e00bcfcdc6e774a, NAME =>
> 'usertable,user1004,1649256577246.5c0dac3f7410e8c91e00bcfcdc6e774a.', ST
> ARTKEY => 'user1004', ENDKEY => 'user1008'} to server
> emr-worker-1.cluster-18871,16020,1649346155842, this usually because the
> server is alread dead, give up and mark the procedure as complete, the parent
> procedure will take care of this.
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException:
> emr-worker-1.cluster-18871,16020,1649346155842; pid=35353, ppid=33950,
> state=RUNNABLE, hasLock=true; OpenRegionProcedure 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:172)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:283)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:56)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
>
> 2022-04-09 14:44:11,689 INFO [PEWorker-1] procedure2.ProcedureExecutor:
> Finished pid=35353, ppid=33950, state=SUCCESS, hasLock=false;
> OpenRegionProcedure 5c0dac3f7410e8c91e00bcfcdc6e774a,
> server=emr-worker-1.cluster-18871,16020,1649346155842 in 2 mins, 37.672 sec
> 2022-04-09 14:44:11,697 INFO [PEWorker-1] procedure2.ProcedureExecutor:
> Finished pid=33950, ppid=33615, state=SUCCESS, hasLock=false;
> TransitRegionStateProcedure table=usertable,
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN in 2 mins, 38.865 sec{code}
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)