[ 
https://issues.apache.org/jira/browse/HBASE-24545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134366#comment-17134366
 ] 

Michael Stack commented on HBASE-24545:
---------------------------------------

Just for illustration of the problem described, here is where a single thread 
was hanging out:
{code}
"KeepAlivePEWorker-158" #909 daemon prio=5 os_prio=0 tid=0x0000000001fb5000 
nid=0x29e in Object.wait() [0x00007f73fda29000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1529)
        - locked <0x00007f7c64048020> (a org.apache.zookeeper.ClientCnxn$Packet)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1512)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2587)
        at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:283)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:502)
        at 
org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
        at 
org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:333)
        - locked <0x00007f76381dc690> (a 
org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
        at 
org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)
        at 
org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:350)
        at 
org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:335)
        at 
org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:272)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.splitLogs(ServerCrashProcedure.java:312)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:197)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
        at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
        at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1669)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1416)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:79)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1986)
{code}

> Add backoff to SCP check on WAL split completion
> ------------------------------------------------
>
>                 Key: HBASE-24545
>                 URL: https://issues.apache.org/jira/browse/HBASE-24545
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> Crashed cluster. Lots of backed up WALs. Startup. Recover hundreds of 
> servers; each has a running SCP. Taking a thread dump during recovery, I 
> noticed that there were 160 threads each in SCP waiting on split WAL 
> completion. Each thread was scanning zk splitWAL directory every 100ms. The 
> dir had thousands of entries in it so each check was pulling down MB from 
> zk... * 160 (max configured PE threads (16) * 10 for the KeepAlive factor 
> that has us do 10 * configured PEs as max for PE worker pool).
> If lots of remaining WALs to split, have the SCP backoff on its wait so it 
> checks less frequently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to