[
https://issues.apache.org/jira/browse/HBASE-27277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724662#comment-17724662
]
Duo Zhang commented on HBASE-27277:
-----------------------------------
For normal run, we will see this
{noformat}
2023-04-07T13:52:53,904 DEBUG [RegionServerTracker-0]
assignment.RegionRemoteProcedureBase(122): pid=10, ppid=7, state=RUNNABLE,
hasLock=false; OpenRegionProcedure dd97971f0a037756d8b0365e8a42cda8,
server=c17ceca693c4,45433,1680875568871 for region state=OPENING,
location=c17ceca693c4,45433,1680875568871, table=Race,
region=dd97971f0a037756d8b0365e8a42cda8, targetServer
c17ceca693c4,45433,1680875568871 is dead, SCP will interrupt us, give up
{noformat}
So the difference here is that, for normal run, we will fail when dispatching
the ORP, and the TRSP will schedule a new ORP again, but in the failed run, we
will fail after dispatching, so we will expect SCP to interrupt us but in the
UT we expect the TRSP could finish while the SCP is hang and we will hang the
SCP intentionally, so we get a dead lock...
FWIW, we will only resume the TRSP while we kill the region server, so I do not
think we want to test the scenario where we send a ORP to a region server and
the region server dead before returning, so let me see how to make sure that we
will fall into the dead lock scenario.
Thanks.
> TestRaceBetweenSCPAndTRSP fails in pre commit
> ---------------------------------------------
>
> Key: HBASE-27277
> URL: https://issues.apache.org/jira/browse/HBASE-27277
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Reporter: Duo Zhang
> Priority: Major
> Attachments:
> org.apache.hadoop.hbase.master.assignment.TestRaceBetweenSCPAndTRSP-output.txt
>
>
> Seems the PE worker is stuck here. Need dig more.
> {noformat}
> "PEWorker-5" daemon prio=5 tid=326 in Object.wait()
> java.lang.Thread.State: WAITING (on object monitor)
> at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
> at
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
> at
> [email protected]/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
> at
> app//org.apache.hadoop.hbase.master.assignment.TestRaceBetweenSCPAndTRSP$AssignmentManagerForTest.getRegionsOnServer(TestRaceBetweenSCPAndTRSP.java:97)
> at
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.getRegionsOnCrashedServer(ServerCrashProcedure.java:288)
> at
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:195)
> at
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:66)
> at
> app//org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
> at
> app//org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:919)
> at
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
> at
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
> at
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
> at
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
> at
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$477/0x0000000800ac1840.call(Unknown
> Source)
> at
> app//org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
> at
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)