[
https://issues.apache.org/jira/browse/HBASE-28113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777571#comment-17777571
]
Hudson commented on HBASE-28113:
--------------------------------
Results for branch branch-2.4
[build #637 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/637/]:
(/) *{color:green}+1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/637/General_20Nightly_20Build_20Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/637/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/637/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/637/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Modify the way of acquiring the RegionStateNode lock in
> checkOnlineRegionsReport to tryLock
> -------------------------------------------------------------------------------------------
>
> Key: HBASE-28113
> URL: https://issues.apache.org/jira/browse/HBASE-28113
> Project: HBase
> Issue Type: Improvement
> Components: master
> Reporter: Haiping lv
> Assignee: Haiping lv
> Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
> Attachments: master.stack
>
>
> HBase Cluster description: *1 master and 5 region servers*
> During the execution of itbll process, when ChaosMonkey performs
> RestartRandomRsAction, it triggers this issue.
> The steps for the RestartRandomRsAction operation are as follows.{*}:{*}
> # stop node-3, node-2, node-4。
> # then stop the node-5 that holds the meta node.
> # start node-3
> # then stop node-1。
> # start node-2, node-4, node-5, node-1。
> *Fault description:*
> 1. The RegionServer nodes, including node-2, node-4, node-5, and node-1, are
> unable to come online.
> Observing the RegionServer logs, the reportForDuty operation consistently
> times out. The log is as follows:
> {code:java}
> 2023-09-21T08:05:30,251 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:05:43,581 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:05:59,591 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:06:21,601 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:06:55,611 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:07:53,620 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:09:39,631 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:13:01,642 INFO [regionserver/core-1-2:16020]
> regionserver.HRegionServer: reportForDuty to
> master=master-1-1,16000,1695254395517 with port=16020,
> startcode=1695254725874 {code}
> 2. The master thread is blocked.
> * All two RpcServer.priority.RWQ.Fifo.write.handler threads are blocked on
> RegionStateNode.lock
> {code:java}
> "RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000" #67 daemon
> prio=5 os_prio=0 tid=0x00007f6ae3caf800 nid=0xea405 waiting on condition
> [0x00007f6aa1dcd000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000004e3c8e6f0> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
> at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1401)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:1363)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:639)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:17395)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:437)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
> at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) {code}
> * 20 PEWorker threads are blocked on RegionStateStore.updateRegionLocation.
> {code:java}
> "PEWorker-1" #133 daemon prio=5 os_prio=0 tid=0x00007f6acdcf9800 nid=0xea5bc
> waiting on condition [0x00007f6a9d799000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000004e4cc8e58> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
> at
> org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedAbnormally(AssignmentManager.java:2076)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:305)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:57)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown
> Source)
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
> {code}
> * All four KeepAlivePEWorker threads are blocked.
> KeepAlivePEWorker-17 18 19 are blocked on
> RegionStateStore.updateRegionLocation
> {code:java}
> "KeepAlivePEWorker-17" #381 daemon prio=5 os_prio=0 tid=0x000056260b75d000
> nid=0xeffb0 waiting on condition [0x00007f6a94339000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000004ebf83440> (a
> java.util.concurrent.CompletableFuture$Signaller)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
> at
> org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1982)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1997)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:279)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:434)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:111)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:398)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:111)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown
> Source)
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
> {code}
> * KeepAlivePEWorker-20 are blocked on RegionStateNode.lock
> {code:java}
> "KeepAlivePEWorker-20" #388 daemon prio=5 os_prio=0 tid=0x000056260b847800
> nid=0xf02da waiting on condition [0x00007f6a92e25000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000004e3c8d990> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
> at
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
> at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:551)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:243)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:68)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown
> Source)
> at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)