[jira] [Updated] (HBASE-26914) Rebuild a cluster from an existing root directory may hang

LiangJun He (Jira) Sat, 09 Apr 2022 00:14:05 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


LiangJun He updated HBASE-26914:
--------------------------------
    Description: 
After HBASE-26245, we can rebuild a cluster from an existing root directory on 
cloud environment, but we still have a problem,  if we stop the regionservers 
of the old cluster first, and then stop the masters,  new cluster(rebuild from 
old cluster's root directory) may hang.

This problem is also described in HBASE-26898.

 

Hang Reason:

For example, if RS1 is stopped first, the Master will expire RS1, and will 
generate a ServerCrashProcedure to reassign the region on RS1 (at the same 
time, the Master will delete the RS1 node information saved in the local 
region), and then generate subProcedure TransitRegionStateProcedure for 
reassignment, that is planned to assign the region to RS2 (RS2 has not been 
stopped at this time), but then RS2 is stopped, and the Master is also stopped.

Rebuild a cluster from an existing root directory(from old cluster on OSS), the 
Master will recover from the Procedure state and continue to execute the 
unfinished Procedure. At this time, there is an uncompleted 
TransitRegionStateProcedure, and the target-RS to be assigned to the region is 
the RS2 of the old cluster, it will report warn message：
{code:java}
2022-04-09 14:44:00,050 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
procedure.MasterProcedureScheduler: Took xlock for pid=33950, ppid=33615, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=usertable, 
region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN
2022-04-09 14:44:00,241 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
assignment.AssignmentManager: Attach pid=33950, ppid=33615, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=usertable, 
region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN to state=OFFLINE, 
location=null, table=usertable, region=5c0dac3f7410e8c91e00bcfcdc6e774a to 
restore RIT
2022-04-09 14:44:11,678 WARN  [PEWorker-1] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=35353, 
ppid=33950, state=RUNNABLE, hasLock=true; OpenRegionProcedure 
5c0dac3f7410e8c91e00b
cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842 for region 
{ENCODED => 5c0dac3f7410e8c91e00bcfcdc6e774a, NAME => 
'usertable,user1004,1649256577246.5c0dac3f7410e8c91e00bcfcdc6e774a.', ST
ARTKEY => 'user1004', ENDKEY => 'user1008'} to server 
emr-worker-1.cluster-18871,16020,1649346155842, this usually because the server 
is alread dead, give up and mark the procedure as complete, the parent
 procedure will take care of this.
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
emr-worker-1.cluster-18871,16020,1649346155842; pid=35353, ppid=33950, 
state=RUNNABLE, hasLock=true; OpenRegionProcedure 5c0dac3f7410e8c91e00b
cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842
        at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:172)
        at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:283)
        at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:56)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
 
2022-04-09 14:44:11,689 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
Finished pid=35353, ppid=33950, state=SUCCESS, hasLock=false; 
OpenRegionProcedure 5c0dac3f7410e8c91e00bcfcdc6e774a, 
server=emr-worker-1.cluster-18871,16020,1649346155842 in 2 mins, 37.672 sec
2022-04-09 14:44:11,697 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
Finished pid=33950, ppid=33615, state=SUCCESS, hasLock=false; 
TransitRegionStateProcedure table=usertable, 
region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN in 2 mins, 38.865 sec{code}
 

 

 

 

 

 

  was:
After HBASE-26245, we can rebuild a cluster from an existing root directory on 
cloud environment, but we still have a problem,  if we stop the regionservers 
of the old cluster first, and then stop the masters,  new cluster(rebuild from 
old cluster's root directory) may hang.

This problem is also described in HBASE-26898.

 

Hang Reason:

For example, if RS1 is stopped first, the Master will expire RS1, and will 
generate a ServerCrashProcedure to reassign the region on RS1 (at the same 
time, the Master will delete the RS1 node information saved in the local 
region), and then generate subProcedure TransitRegionStateProcedure for 
reassignment, that is planned to assign the region to RS2 (RS2 has not been 
stopped at this time), but then RS2 is stopped, and the Master is also stopped.

Rebuild a cluster from an existing root directory(from old cluster on OSS), the 
Master will recover from the Procedure state and continue to execute the 
unfinished Procedure. At this time, there is an uncompleted 
TransitRegionStateProcedure, and the target-RS to be assigned to the region is 
the RS2 of the old cluster, it will report warn message：

 
{code:java}
2022-04-09 14:44:11,678 WARN  [PEWorker-1] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=35353, 
ppid=33950, state=RUNNABLE, hasLock=true; OpenRegionProcedure 
5c0dac3f7410e8c91e00b
cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842 for region 
{ENCODED => 5c0dac3f7410e8c91e00bcfcdc6e774a, NAME => 
'usertable,user1004,1649256577246.5c0dac3f7410e8c91e00bcfcdc6e774a.', ST
ARTKEY => 'user1004', ENDKEY => 'user1008'} to server 
emr-worker-1.cluster-18871,16020,1649346155842, this usually because the server 
is alread dead, give up and mark the procedure as complete, the parent
 procedure will take care of this.
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
emr-worker-1.cluster-18871,16020,1649346155842; pid=35353, ppid=33950, 
state=RUNNABLE, hasLock=true; OpenRegionProcedure 5c0dac3f7410e8c91e00b
cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842
        at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:172)
        at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:283)
        at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:56)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
 {code}
 

 

 

 


> Rebuild a cluster from an existing root directory may hang
> ----------------------------------------------------------
>
>                 Key: HBASE-26914
>                 URL: https://issues.apache.org/jira/browse/HBASE-26914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: LiangJun He
>            Assignee: LiangJun He
>            Priority: Major
>             Fix For: 3.0.0-alpha-2
>
>
> After HBASE-26245, we can rebuild a cluster from an existing root directory 
> on cloud environment, but we still have a problem,  if we stop the 
> regionservers of the old cluster first, and then stop the masters,  new 
> cluster(rebuild from old cluster's root directory) may hang.
> This problem is also described in HBASE-26898.
>  
> Hang Reason:
> For example, if RS1 is stopped first, the Master will expire RS1, and will 
> generate a ServerCrashProcedure to reassign the region on RS1 (at the same 
> time, the Master will delete the RS1 node information saved in the local 
> region), and then generate subProcedure TransitRegionStateProcedure for 
> reassignment, that is planned to assign the region to RS2 (RS2 has not been 
> stopped at this time), but then RS2 is stopped, and the Master is also 
> stopped.
> Rebuild a cluster from an existing root directory(from old cluster on OSS), 
> the Master will recover from the Procedure state and continue to execute the 
> unfinished Procedure. At this time, there is an uncompleted 
> TransitRegionStateProcedure, and the target-RS to be assigned to the region 
> is the RS2 of the old cluster, it will report warn message：
> {code:java}
> 2022-04-09 14:44:00,050 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
> procedure.MasterProcedureScheduler: Took xlock for pid=33950, ppid=33615, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN
> 2022-04-09 14:44:00,241 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
> assignment.AssignmentManager: Attach pid=33950, ppid=33615, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN to state=OFFLINE, 
> location=null, table=usertable, region=5c0dac3f7410e8c91e00bcfcdc6e774a to 
> restore RIT
> 2022-04-09 14:44:11,678 WARN  [PEWorker-1] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=35353, 
> ppid=33950, state=RUNNABLE, hasLock=true; OpenRegionProcedure 
> 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842 for region 
> {ENCODED => 5c0dac3f7410e8c91e00bcfcdc6e774a, NAME => 
> 'usertable,user1004,1649256577246.5c0dac3f7410e8c91e00bcfcdc6e774a.', ST
> ARTKEY => 'user1004', ENDKEY => 'user1008'} to server 
> emr-worker-1.cluster-18871,16020,1649346155842, this usually because the 
> server is alread dead, give up and mark the procedure as complete, the parent
>  procedure will take care of this.
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> emr-worker-1.cluster-18871,16020,1649346155842; pid=35353, ppid=33950, 
> state=RUNNABLE, hasLock=true; OpenRegionProcedure 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842
>         at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:172)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:283)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:56)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
>  
> 2022-04-09 14:44:11,689 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Finished pid=35353, ppid=33950, state=SUCCESS, hasLock=false; 
> OpenRegionProcedure 5c0dac3f7410e8c91e00bcfcdc6e774a, 
> server=emr-worker-1.cluster-18871,16020,1649346155842 in 2 mins, 37.672 sec
> 2022-04-09 14:44:11,697 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Finished pid=33950, ppid=33615, state=SUCCESS, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN in 2 mins, 38.865 sec{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HBASE-26914) Rebuild a cluster from an existing root directory may hang

Reply via email to