[ 
https://issues.apache.org/jira/browse/HBASE-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956019#comment-17956019
 ] 

chaijunjie commented on HBASE-29364:
------------------------------------

did you test in a new version? or try to backport 
https://issues.apache.org/jira/browse/HBASE-28180 to test again?

> Region will be opened in unknown regionserver when master is changed & rs 
> crashed
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-29364
>                 URL: https://issues.apache.org/jira/browse/HBASE-29364
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Zhiwen Deng
>            Priority: Major
>
> We have encountered multiple cases where regions were opened on RegionServers 
> (RS) that had already been offlined. It wasn't until recently that we 
> discovered a potential cause for this issue, and the details of the problem 
> are as follows:
> Our HDFS storage reached the online level, which caused the upper-level 
> master and rs to be unable to write and abort. Finally, we manually accessed 
> and deleted some data, and HDFS was restored. Then the hbck report showed 
> that some regions were opened on the offline rs, which caused these regions 
> to be unable to server. We finally used hbck2 to assigns these regions, and 
> the problem was solved.
> h3. # The Problem
> Here is the analysis of the region transition for one specific region: 
> 19f709990ad65ce3d51ddeaf29acf436:
> 2025-05-21, 05:48:11 : The region was assigned to 
> {+}rs-hostname,20700,1747777624803{+}, but due to some anomalies, it could 
> not be opened on the target RS. Finally, the RS reported the open result to 
> the Master:
> {code:java}
> 2025-05-21,05:48:11,646 INFO 
> [RpcServer.priority.RWQ.Fifo.write.handler=2,queue=0,port=20600] 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase: Received 
> report from rs-hostname,20700,1747777624803, transitionCode=FAILED_OPEN, 
> seqId=-1, regionNode=state=OPENING, location=rs-hostname,20700,1747777624803, 
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436, proc=pid=78499, 
> ppid=78034, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure {code}
>  2025-05-21, 05:48:27 : rs-hostname,20700,1747777624803 went offline.
> {code:java}
> 2025-05-21, 05:48:27,981 INFO [KeepAlivePEWorker-65] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=78671, 
> state=SUCCESS; ServerCrashProcedure server=rs-hostname,20700,1747777624803, 
> splitWal=true, meta=false in 16.0720 sec {code}
> 2025-05-21, 05:49:12,312 : The RS hosting the hbase:meta table also 
> encountered a failure, making the meta table unavailable, which caused the 
> above region to get stuck in the RIT (Region-In-Transition) process.
> {code:java}
> 2025-05-21, 05:49:12,312 WARN [ProcExecTimeout] 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPENING, location=rs-hostname,20700,1747777624803, 
> table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 {code}
> Due to the HDFS failure, the Master also performed an abort action. The new 
> active Master continued to execute the previously incomplete procedure.
> {code:java}
> 2025-05-21, 06:02:38,423 INFO 
> [master/master-hostname:20600:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=78034, ppid=77973, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=test:xxx, 
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN
> 2025-05-21, 06:02:38,572 INFO [master/master-hostname:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach 
> pid=78034, ppid=77973, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=test:xxx, 
> region=19f709990ad65ce3d51ddeaf29acf436, ASSIGN to state=OFFLINE, 
> location=null, table=test:xxx, region=19f709990ad65ce3d51ddeaf29acf436 to 
> restore RIT {code}
> When the Master switches, it modifies the region's state in the meta table 
> based on the procedure's status
> {code:java}
> 2025-05-21, 06:07:52,433 INFO 
> [master/master-hostname:20600:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: Load hbase:meta 
> entry region=19f709990ad65ce3d51ddeaf29acf436, regionState=OPENING, 
> lastHost=rs-hostname-last,20700,1747776391310, 
> regionLocation=rs-hostname,20700,1747777624803, openSeqNum=174628702
> 2025-05-21, 06:07:52,433 WARN 
> [master/master-hostname:20600:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure: Received 
> report OPENED transition from rs-hostname,20700,1747777624803 for 
> state=OPENING, location=rs-hostname,20700,1747777624803, table=test:xxx, 
> region=19f709990ad65ce3d51ddeaf29acf436, pid=78499 but the new openSeqNum -1 
> is less than the current one 174628702, ignoring...
>  {code}
> I reviewed the relevant code and found that at this point, the region's state 
> in the Master's memory is changed to OPENED, and with the state transition of 
> RegionRemoteProcedureBase, it is persisted in the meta table.
> {code:java}
> void stateLoaded(AssignmentManager am, RegionStateNode regionNode){
>   if (state == 
> RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED) {
>     try {
>       restoreSucceedState(am, regionNode, seqId);
>     } catch (IOException e) {
>       // should not happen as we are just restoring the state
>       throw new AssertionError(e);
>     }
>   }
> } 
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode 
> regionNode,
>   long openSeqNum) throws IOException {
>   if (regionNode.getState() == State.OPEN) {
>     // should have already been persisted, ignore
>     return;
>   }
>   regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, 
> openSeqNum);
> }{code}
> Therefore, a failed open state was persisted to the meta table as OPEN, and 
> because the RS had already completed the SCP, its region would not be 
> processed again.
> {code:java}
> 2025-05-21, 06:07:53,138 INFO [PEWorker-56] 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=78034 
> updating hbase:meta row=19f709990ad65ce3d51ddeaf29acf436, regionState=OPEN, 
> repBarrier=174628702, openSeqNum=174628702, 
> regionLocation=rs-hostname,20700,1747777624803 {code}
> h3. # How to fix
> We can refer to the logic when the Master does not switch, where if the open 
> fails, the region's state is not modified, thereby preventing the above 
> process from occurring.
> {code:java}
> @Override
> protected void updateTransitionWithoutPersistingToMeta(MasterProcedureEnv env,
>   RegionStateNode regionNode, TransitionCode transitionCode, long openSeqNum) 
> throws IOException {
>   if (transitionCode == TransitionCode.OPENED) {
>     regionOpenedWithoutPersistingToMeta(env.getAssignmentManager(), 
> regionNode, transitionCode,
>       openSeqNum);
>   } else {
>     assert transitionCode == TransitionCode.FAILED_OPEN;
>     // will not persist to meta if giveUp is false
>     env.getAssignmentManager().regionFailedOpen(regionNode, false);
>   }
> } {code}
>  
> So, we only need to add a check in restoreSucceedState:
> {code:java}
> @Override
> protected void restoreSucceedState(AssignmentManager am, RegionStateNode 
> regionNode,
>   long openSeqNum) throws IOException {
>   if (regionNode.getState() == State.OPEN) {
>     // should have already been persisted, ignore
>     return;
>   }
>   regionOpenedWithoutPersistingToMeta(am, regionNode, TransitionCode.OPENED, 
> openSeqNum);
>   if (super.transitionCode == TransitionCode.OPENED) {
>     regionOpenedWithoutPersistingToMeta(am, regionNode, 
> TransitionCode.OPENED, openSeqNum);
>   }
>   // if status is not OPENED, will not change regionState.
>   // otherwise region may be opened on an expired region server.
> } {code}
> I want to know whether my fix is ​​OK, if so I will submit a patch to fix it.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to