[jira] [Commented] (HBASE-26914) Rebuild a cluster from an existing root directory may hang

LiangJun He (Jira) Mon, 11 Apr 2022 09:07:08 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520677#comment-17520677
 ]


LiangJun He commented on HBASE-26914:
-------------------------------------

>From my many tests, I found the following problems:
1. When the old cluster stop RS1 first, the RS1 information stored in the 
Master local region will be expired and deleted. Even if the SCP is not 
successfully executed (it may be executing TRSP), when the new cluster is 
started, these TRSPs will continue to be executed. The execution will fail 
because the assigned target RSs are all from the old cluster. Since the master 
local region does not have the RS1 information of the old cluster, the SCP 
cannot be activated to reassign the region above the old RS1 when the new 
cluster is started.

2. Sometimes the test also found that after the old cluster region1 was 
successfully reassigned from RS1 to RS2, and the status had been synchronized 
to the hbase:meta table, but when the new cluster was pulled up, it was found 
that the location recorded by region1 in hbase:meta was still RS1 , but at this 
time RS1 cannot trigger SCP to recover.

3. In some abnormal cases, the wal of the master:store table and hbase:meta 
table is in the hdfs cluster of the old cluster, so that the new cluster cannot 
be obtained to restore the normal state. At this time, the new cluster may not 
be able to be reassigned region through SCP.

The second problem above may be a problem with my environment, but I didn't 
find the corresponding error log.

In the end, I manually triggered the SCP of unknownserver by calling the 
scheduleSCPsForUnknownServers() interface through RPC, which solved the 
corresponding problem.

Therefore, can we define a command to trigger the 
scheduleSCPsForUnknownServers() interface in the hbase shell? In the old 
version, this interface was triggered in HBCK2, but the latest version of HBase 
does not support it.

[~zhangduo] 

> Rebuild a cluster from an existing root directory may hang
> ----------------------------------------------------------
>
>                 Key: HBASE-26914
>                 URL: https://issues.apache.org/jira/browse/HBASE-26914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 3.0.0-alpha-2
>            Reporter: LiangJun He
>            Assignee: LiangJun He
>            Priority: Major
>             Fix For: 3.0.0-alpha-2
>
>
> After HBASE-26245, we can rebuild a cluster from an existing root directory 
> on cloud environment, but we still have a problem,  if we stop the 
> regionservers of the old cluster first, and then stop the masters,  new 
> cluster(rebuild from old cluster's root directory) may hang.
> This problem is also described in HBASE-26898.
>  
> Hang Reason:
> For example, if RS1 is stopped first, the Master will expire RS1, and will 
> generate a ServerCrashProcedure to reassign the region on RS1 (at the same 
> time, the Master will delete the RS1 node information saved in the local 
> region), and then generate subProcedure TransitRegionStateProcedure for 
> reassignment, that is planned to assign the region to RS2 (RS2 has not been 
> stopped at this time), but then RS2 is stopped, and the Master is also 
> stopped.
> Rebuild a cluster from an existing root directory(from old cluster on OSS), 
> the Master will recover from the Procedure state and continue to execute the 
> unfinished Procedure. At this time, there is an uncompleted 
> TransitRegionStateProcedure, and the target-RS to be assigned to the region 
> is the RS2 of the old cluster, it will report warn message：
> {code:java}
> 2022-04-09 14:44:00,050 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
> procedure.MasterProcedureScheduler: Took xlock for pid=33950, ppid=33615, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN
> 2022-04-09 14:44:00,241 INFO  [master/emr-header-1:16000:becomeActiveMaster] 
> assignment.AssignmentManager: Attach pid=33950, ppid=33615, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN to state=OFFLINE, 
> location=null, table=usertable, region=5c0dac3f7410e8c91e00bcfcdc6e774a to 
> restore RIT
> 2022-04-09 14:44:11,678 WARN  [PEWorker-1] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=35353, 
> ppid=33950, state=RUNNABLE, hasLock=true; OpenRegionProcedure 
> 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842 for region 
> {ENCODED => 5c0dac3f7410e8c91e00bcfcdc6e774a, NAME => 
> 'usertable,user1004,1649256577246.5c0dac3f7410e8c91e00bcfcdc6e774a.', ST
> ARTKEY => 'user1004', ENDKEY => 'user1008'} to server 
> emr-worker-1.cluster-18871,16020,1649346155842, this usually because the 
> server is alread dead, give up and mark the procedure as complete, the parent
>  procedure will take care of this.
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> emr-worker-1.cluster-18871,16020,1649346155842; pid=35353, ppid=33950, 
> state=RUNNABLE, hasLock=true; OpenRegionProcedure 5c0dac3f7410e8c91e00b
> cfcdc6e774a, server=emr-worker-1.cluster-18871,16020,1649346155842
>         at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:172)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:283)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:56)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981)
>  
> 2022-04-09 14:44:11,689 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Finished pid=35353, ppid=33950, state=SUCCESS, hasLock=false; 
> OpenRegionProcedure 5c0dac3f7410e8c91e00bcfcdc6e774a, 
> server=emr-worker-1.cluster-18871,16020,1649346155842 in 2 mins, 37.672 sec
> 2022-04-09 14:44:11,697 INFO  [PEWorker-1] procedure2.ProcedureExecutor: 
> Finished pid=33950, ppid=33615, state=SUCCESS, hasLock=false; 
> TransitRegionStateProcedure table=usertable, 
> region=5c0dac3f7410e8c91e00bcfcdc6e774a, ASSIGN in 2 mins, 38.865 sec{code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HBASE-26914) Rebuild a cluster from an existing root directory may hang

Reply via email to