qazwsx created HBASE-29552: ------------------------------ Summary: RegionRemoteProcedureBase inconsistent state loading caused startup failure. Key: HBASE-29552 URL: https://issues.apache.org/jira/browse/HBASE-29552 Project: HBase Issue Type: Bug Reporter: qazwsx
Before the power failure, the Region (9bf8064aa66e5c6391bcf1d291f5e3fa) was performing a balance operation, which triggered the TransitRegionStateProcedure to execute the Move operation. Due to the fact that part of the in-memory data of HDFS was not persisted when the power failure occurred, the state of the Region recorded in the META table was shown as OPENING, and the Procedure record with pid=53510 was lost. After the system was started, when loadProcedure reloaded the CloseRegionProcedure, the transitionState operation failed, which ultimately led to the failure of the Master service to start. # log before power off: 2025-08-24 13:53:33,254 | INFO | master/ndp-hbase-master-1:16000.Chore.5 | balance hri=9bf8064aa66e5c6391bcf1d291f5e3fa, source=ndp-hbase-region-0.hbaseregion.sop.svc.cluster.local,16020,1756010506000, destination=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228 | org.apache.hadoop.hbase.master.HMaster.executeRegionPlansWithThrottling(HMaster.java:1987) 2025-08-24 13:53:33,266 | INFO | PEWorker-10 | Initialized subprocedures=[\{pid=53505, ppid=53504, state=RUNNABLE; CloseRegionProcedure 9bf8064aa66e5c6391bcf1d291f5e3fa, server=ndp-hbase-region-0.hbaseregion.sop.svc.cluster.local,16020,1756010506000}] | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1685) 2025-08-24 13:53:33,423 | INFO | RSProcedureDispatcher-pool-23 | Using KERBEROS authentication for service=AdminService, sasl=true, type='kerberos' | org.apache.hadoop.hbase.ipc.RpcConnection.<init>(RpcConnection.java:124) 2025-08-24 14:01:50,565 | INFO | PEWorker-15 | pid=53504 updating hbase:meta row=9bf8064aa66e5c6391bcf1d291f5e3fa, regionState=CLOSED | org.apache.hadoop.hbase.master.assignment.RegionStateStore.createPutForRegionLocUpdate(RegionStateStore.java:253) 2025-08-24 14:01:50,569 | INFO | PEWorker-15 | Finished pid=53505, ppid=53504, state=SUCCESS; CloseRegionProcedure 9bf8064aa66e5c6391bcf1d291f5e3fa, server=ndp-hbase-region-0.hbaseregion.sop.svc.cluster.local,16020,1756010506000 in 8 mins, 17.302 sec | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414) 2025-08-24 14:01:50,569 | INFO | PEWorker-12 | Starting pid=53504, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=student001, region=9bf8064aa66e5c6391bcf1d291f5e3fa, REOPEN/MOVE; state=CLOSED, location=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228; forceNewPlan=false, retain=false | org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.queueAssign(TransitRegionStateProcedure.java:250) 2025-08-24 14:01:50,720 | INFO | PEWorker-18 | pid=53504 updating hbase:meta row=9bf8064aa66e5c6391bcf1d291f5e3fa, regionState=OPENING, regionLocation=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228 | org.apache.hadoop.hbase.master.assignment.RegionStateStore.createPutForRegionLocUpdate(RegionStateStore.java:253) 2025-08-24 14:01:50,726 | INFO | PEWorker-18 | Initialized subprocedures=[\{pid=53510, ppid=53504, state=RUNNABLE; OpenRegionProcedure 9bf8064aa66e5c6391bcf1d291f5e3fa, server=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228}] | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1685) 2025-08-24 14:01:51,054 | INFO | PEWorker-5 | pid=53504 updating hbase:meta row=9bf8064aa66e5c6391bcf1d291f5e3fa, regionState=OPEN, openSeqNum=96213, regionLocation=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228 | org.apache.hadoop.hbase.master.assignment.RegionStateStore.createPutForRegionLocUpdate(RegionStateStore.java:253) 2025-08-24 14:01:51,059 | INFO | PEWorker-5 | Finished pid=53510, ppid=53504, state=SUCCESS; OpenRegionProcedure 9bf8064aa66e5c6391bcf1d291f5e3fa, server=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228 in 330 msec | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414) 2025-08-24 14:01:51,060 | INFO | PEWorker-7 | Finished pid=53504, state=SUCCESS; TransitRegionStateProcedure table=student001, region=9bf8064aa66e5c6391bcf1d291f5e3fa, REOPEN/MOVE in 8 mins, 17.804 sec | org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414) 掉电恢复启动失败 2025-08-24 14:58:19,266 | ERROR | master/ndp-hbase-master-1:16000:becomeActiveMaster | Failed to become active master | org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2393) java.lang.AssertionError: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected [CLOSING, CLOSED] so could move to CLOSED but current state=OPENING at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.stateLoaded(RegionRemoteProcedureBase.java:290) at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.stateLoaded(TransitRegionStateProcedure.java:668) at org.apache.hadoop.hbase.master.assignment.AssignmentManager$RegionMetaLoadingVisitor.visitRegionState(AssignmentManager.java:1879) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.visitMetaEntry(RegionStateStore.java:153) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.access$100(RegionStateStore.java:66) at org.apache.hadoop.hbase.master.assignment.RegionStateStore$1.visit(RegionStateStore.java:95) at org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:809) at org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:755) at org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:716) at org.apache.hadoop.hbase.MetaTableAccessor.fullScanRegions(MetaTableAccessor.java:193) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.visitMeta(RegionStateStore.java:85) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.loadMeta(AssignmentManager.java:1909) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.joinCluster(AssignmentManager.java:1779) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1035) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2389) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:558) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected [CLOSING, CLOSED] so could move to CLOSED but current state=OPENING at org.apache.hadoop.hbase.master.assignment.RegionStateNode.transitionState(RegionStateNode.java:142) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedWithoutPersistingToMeta(AssignmentManager.java:2234) at org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure.restoreSucceedState(CloseRegionProcedure.java:116) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.stateLoaded(RegionRemoteProcedureBase.java:287) ... 16 more # my question The entry condition for the {{RegionRemoteProcedureBase#restoreSucceedState}} method is {{{}RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED{}}}. Is it possible to skip the expected result verification when {{regionNode.transitionState}} is executed? -- This message was sent by Atlassian Jira (v8.20.10#820010)