[
https://issues.apache.org/jira/browse/HBASE-24545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134366#comment-17134366
]
Michael Stack commented on HBASE-24545:
---------------------------------------
Just for illustration of the problem described, here is where a single thread
was hanging out:
{code}
"KeepAlivePEWorker-158" #909 daemon prio=5 os_prio=0 tid=0x0000000001fb5000
nid=0x29e in Object.wait() [0x00007f73fda29000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
- locked <0x00007f7c64048020> (a org.apache.zookeeper.ClientCnxn$Packet)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2587)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:283)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:502)
at
org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
at
org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:333)
- locked <0x00007f76381dc690> (a
org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
at
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)
at
org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:350)
at
org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:335)
at
org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:272)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:312)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:197)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1669)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1416)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:79)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1986)
{code}
> Add backoff to SCP check on WAL split completion
> ------------------------------------------------
>
> Key: HBASE-24545
> URL: https://issues.apache.org/jira/browse/HBASE-24545
> Project: HBase
> Issue Type: Bug
> Reporter: Michael Stack
> Assignee: Michael Stack
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> Crashed cluster. Lots of backed up WALs. Startup. Recover hundreds of
> servers; each has a running SCP. Taking a thread dump during recovery, I
> noticed that there were 160 threads each in SCP waiting on split WAL
> completion. Each thread was scanning zk splitWAL directory every 100ms. The
> dir had thousands of entries in it so each check was pulling down MB from
> zk... * 160 (max configured PE threads (16) * 10 for the KeepAlive factor
> that has us do 10 * configured PEs as max for PE worker pool).
> If lots of remaining WALs to split, have the SCP backoff on its wait so it
> checks less frequently.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)