[jira] [Commented] (HBASE-26914) Rebuild a cluster from an existing root directory may hang

Duo Zhang (Jira) Sat, 09 Apr 2022 01:14:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519926#comment-17519926
 ]


Duo Zhang commented on HBASE-26914:
-----------------------------------

The behavior is expected. It will try to assign the region again. Later after 
we schedule a SCP, the SCP will interrupt this TRSP and let it assign the 
region to another region server.

> Rebuild a cluster from an existing root directory may hang
> ----------------------------------------------------------
>
>                 Key: HBASE-26914
>                 URL: https://issues.apache.org/jira/browse/HBASE-26914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: LiangJun He
>            Assignee: LiangJun He
>            Priority: Major
>             Fix For: 3.0.0-alpha-2
>
>
> After HBASE-26245, we can rebuild a cluster from an existing root directory 
> on cloud environment, but we still have a problem,  if we stop the 
> regionservers of the old cluster first, and then stop the masters,  new 
> cluster(rebuild from old cluster's root directory) may hang.
> This problem is also described in HBASE-26898.
>  
> Hang Reason:
> For example, if RS1 is stopped first, the Master will expire RS1, and will 
> generate a ServerCrashProcedure to reassign the region on RS1 (at the same 
> time, the Master will delete the RS1 node information saved in the local 
> region), and then generate subProcedure TransitRegionStateProcedure for 
> reassignment, that is planned to assign the region to RS2 (RS2 has not been 
> stopped at this time), but then RS2 is stopped, and the Master is also 
> stopped.
> Rebuild a cluster from an existing root directory(from old cluster on OSS), 
> the Master will recover from the Procedure state and continue to execute the 
> unfinished Procedure. At this time, there is an uncompleted 
> TransitRegionStateProcedure, and the target-RS to be assigned to the region 
> is the RS2 of the old cluster, it will report warn message：
> {code:java}
> 2022-04-09 14:44:00,050 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
> procedure.MasterProcedureScheduler: Took xlock for pid=33950, ppid=33615, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN
> 2022-04-09 14:44:00,241 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
> assignment.AssignmentManager: Attach pid=33950, ppid=33615, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN to state=OFFLINE, 
> location=null, table=usertable, region=5c0dac3f7410e8c91e00bcfcdc6e774a to 
> restore RIT
> 2022-04-09 14:44:11,678 WARN  [PEWorker-1] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=35353, 
> ppid=33950, state=RUNNABLE, hasLock=true; OpenRegionProcedure 
> 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842 for region 
> {ENCODED => 5c0dac3f7410e8c91e00bcfcdc6e774a, NAME => 
> 'usertable,user1004,1649256577246.5c0dac3f7410e8c91e00bcfcdc6e774a.', ST
> ARTKEY => 'user1004', ENDKEY => 'user1008'} to server 
> emr-worker-1.cluster-18871,16020,1649346155842, this usually because the 
> server is alread dead, give up and mark the procedure as complete, the parent
>  procedure will take care of this.
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> emr-worker-1.cluster-18871,16020,1649346155842; pid=35353, ppid=33950, 
> state=RUNNABLE, hasLock=true; OpenRegionProcedure 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842
>         at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:172)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:283)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:56)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
>  
> 2022-04-09 14:44:11,689 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Finished pid=35353, ppid=33950, state=SUCCESS, hasLock=false; 
> OpenRegionProcedure 5c0dac3f7410e8c91e00bcfcdc6e774a, 
> server=emr-worker-1.cluster-18871,16020,1649346155842 in 2 mins, 37.672 sec
> 2022-04-09 14:44:11,697 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Finished pid=33950, ppid=33615, state=SUCCESS, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN in 2 mins, 38.865 sec{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HBASE-26914) Rebuild a cluster from an existing root directory may hang

Reply via email to