[ 
https://issues.apache.org/jira/browse/HBASE-29259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943805#comment-17943805
 ] 

Duo Zhang commented on HBASE-29259:
-----------------------------------

OK, there is a possible race could cause this problem.

This is the TRSP in trouble

{noformat}
2025-04-12T09:36:35,150 INFO  [PEWorker-3] procedure.MasterProcedureScheduler: 
Took xlock for pid=411426, ppid=411395, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE, hasLock=false; 
TransitRegionStateProcedure table=IntegrationTestBigLinkedList, 
region=a13d6f17eba604f7e37d981aefc62212, REOPEN/MOVE
2025-04-12T09:36:35,151 INFO  [PEWorker-3] assignment.RegionStateStore: 
pid=411426 updating hbase:meta row=a13d6f17eba604f7e37d981aefc62212, 
regionState=CLOSING, regionLocation=data04,16020,1744450050895
{noformat}

And since hbase:meta is also on data04,16020,1744450050895, the update 
operation hangs, until the SCP finished assigning hbase:meta.

So SCP and TRSP will be executed concurrently. The TRSP created the remoteProc, 
which is a CloseRegionProcedure, and then release the region lock, but before 
we record the CloseRegionProcedure in ProcedureExecutor as a sub procedure, SCP 
called remoteProc.serverCrash and persist the remoteProc to procedure store, 
where it is still in INITIALIZING state.

Let me think how to fix this properly. The correct locking way is to wait until 
the TRSP release the execution lock too, but since we have already hold the 
region node lock in SCP, acquiring execution lock inside it may cause dead 
lock...

> Master crash when loading procedures
> ------------------------------------
>
>                 Key: HBASE-29259
>                 URL: https://issues.apache.org/jira/browse/HBASE-29259
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Major
>
> Hit this error when running ITBLL
> {noformat}
> 2025-04-12T10:32:50,541 ERROR [master/meta01:16000:becomeActiveMaster] 
> master.HMaster: Failed to become active master
> java.lang.UnsupportedOperationException: Unexpected INITIALIZING state for 
> pid=-1, state=INITIALIZING, hasLock=false; CloseRegionProcedure 
> a13d6f17eba604f7e37d981aefc62212, server=data04,16020,1744450050895
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.initializeStacks(ProcedureExecutor.java:453)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:593)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$1.load(ProcedureExecutor.java:344)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:287)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:335)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:688)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1875)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1030)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2554)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:624) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.trace.TraceUtil.lambda$tracedRunnable$2(TraceUtil.java:155)
>  ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at java.lang.Thread.run(Thread.java:840) ~[?:?]
> 2025-04-12T10:32:50,547 ERROR [master/meta01:16000:becomeActiveMaster] 
> master.HMaster: ***** ABORTING master meta01,16000,1744453967314: Unhandled 
> exception. Starting shutdown. *****
> java.lang.UnsupportedOperationException: Unexpected INITIALIZING state for 
> pid=-1, state=INITIALIZING, hasLock=false; CloseRegionProcedure 
> a13d6f17eba604f7e37d981aefc62212, server=data04,16020,1744450050895
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.initializeStacks(ProcedureExecutor.java:453)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:593)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$1.load(ProcedureExecutor.java:344)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:287)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:335)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:688)
>  ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1875)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1030)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2554)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:624) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.trace.TraceUtil.lambda$tracedRunnable$2(TraceUtil.java:155)
>  ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at java.lang.Thread.run(Thread.java:840) ~[?:?]
> {noformat}
> Need to dig more on why this could happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to