[ 
https://issues.apache.org/jira/browse/HBASE-27277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724662#comment-17724662
 ] 

Duo Zhang commented on HBASE-27277:
-----------------------------------

For normal run, we will see this

{noformat}
2023-04-07T13:52:53,904 DEBUG [RegionServerTracker-0] 
assignment.RegionRemoteProcedureBase(122): pid=10, ppid=7, state=RUNNABLE, 
hasLock=false; OpenRegionProcedure dd97971f0a037756d8b0365e8a42cda8, 
server=c17ceca693c4,45433,1680875568871 for region state=OPENING, 
location=c17ceca693c4,45433,1680875568871, table=Race, 
region=dd97971f0a037756d8b0365e8a42cda8, targetServer 
c17ceca693c4,45433,1680875568871 is dead, SCP will interrupt us, give up
{noformat}

So the difference here is that, for normal run, we will fail when dispatching 
the ORP, and the TRSP will schedule a new ORP again, but in the failed run, we 
will fail after dispatching, so we will expect SCP to interrupt us but in the 
UT we expect the TRSP could finish while the SCP is hang and we will hang the 
SCP intentionally, so we get a dead lock...

FWIW, we will only resume the TRSP while we kill the region server, so I do not 
think we want to test the scenario where we send a ORP to a region server and 
the region server dead before returning, so let me see how to make sure that we 
will fall into the dead lock scenario.

Thanks.

> TestRaceBetweenSCPAndTRSP fails in pre commit
> ---------------------------------------------
>
>                 Key: HBASE-27277
>                 URL: https://issues.apache.org/jira/browse/HBASE-27277
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Duo Zhang
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.hbase.master.assignment.TestRaceBetweenSCPAndTRSP-output.txt
>
>
> Seems the PE worker is stuck here. Need dig more.
> {noformat}
> "PEWorker-5" daemon prio=5 tid=326 in Object.wait()
> java.lang.Thread.State: WAITING (on object monitor)
>         at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
>         at 
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
>         at 
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
>         at 
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
>         at 
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
>         at 
> [email protected]/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
>         at 
> app//org.apache.hadoop.hbase.master.assignment.TestRaceBetweenSCPAndTRSP$AssignmentManagerForTest.getRegionsOnServer(TestRaceBetweenSCPAndTRSP.java:97)
>         at 
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.getRegionsOnCrashedServer(ServerCrashProcedure.java:288)
>         at 
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:195)
>         at 
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:66)
>         at 
> app//org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
>         at 
> app//org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:919)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$477/0x0000000800ac1840.call(Unknown
>  Source)
>         at 
> app//org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to