[ 
https://issues.apache.org/jira/browse/HBASE-27277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724679#comment-17724679
 ] 

Duo Zhang commented on HBASE-27277:
-----------------------------------

OK, the root cause for this is that, in the UT, we assume that once the SCP is 
started, the region server should have been removed from the 
RSProcedureDispatcher, but actually, we will submit the SCP in 
ServerManager.expireServer, and then start to call listeners, in one of the 
listeners we will remove the region server from the nodeMap of 
RSProcedureDispatcher. But the SCP will be run in another thread(a PEWorker), 
so it is possible that, before we remove the region server from 
RSProcedureDispatcher, the SCP has already arrived the getRegionsOnServer.

Anyway, this is a test issue only, we can handle the situations for both 
scenario, i.e, we fail before dispatching the OpenRegionProcedure or after 
dispatching. So let me think how to fix the UT.

> TestRaceBetweenSCPAndTRSP fails in pre commit
> ---------------------------------------------
>
>                 Key: HBASE-27277
>                 URL: https://issues.apache.org/jira/browse/HBASE-27277
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Duo Zhang
>            Priority: Major
>         Attachments: 
> org.apache.hadoop.hbase.master.assignment.TestRaceBetweenSCPAndTRSP-output.txt
>
>
> Seems the PE worker is stuck here. Need dig more.
> {noformat}
> "PEWorker-5" daemon prio=5 tid=326 in Object.wait()
> java.lang.Thread.State: WAITING (on object monitor)
>         at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
>         at 
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
>         at 
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
>         at 
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
>         at 
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
>         at 
> [email protected]/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
>         at 
> app//org.apache.hadoop.hbase.master.assignment.TestRaceBetweenSCPAndTRSP$AssignmentManagerForTest.getRegionsOnServer(TestRaceBetweenSCPAndTRSP.java:97)
>         at 
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.getRegionsOnCrashedServer(ServerCrashProcedure.java:288)
>         at 
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:195)
>         at 
> app//org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:66)
>         at 
> app//org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
>         at 
> app//org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:919)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$477/0x0000000800ac1840.call(Unknown
>  Source)
>         at 
> app//org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
>         at 
> app//org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to