[ https://issues.apache.org/jira/browse/HBASE-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912869#comment-16912869 ]
Zheng Hu commented on HBASE-22867: ---------------------------------- Attached two jstack files: 191318.stack and 191318.stack.1 . I got the 191318.stack file firstly, after few seconds, caught the 191318.stack.1 files. In the first file, we can clearly see that there are 6 threads in dir-scan-pool which was blocked and waiting for the SnapshotHFileCleaner#getDeletableFiles. {code} "dir-scan-pool4-thread-8" #18765 daemon prio=5 os_prio=0 tid=0x00007f4a20009c60 nid=0x6576 waiting for monitor entry [0x00007f48a6191000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:68) - waiting to lock <0x000000034411dc88> (a org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:295) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.lambda$traverseAndDelete$1(CleanerChore.java:405) at org.apache.hadoop.hbase.master.cleaner.CleanerChore$$Lambda$187/1141106127.act(Unknown Source) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:460) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:405) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.lambda$null$2(CleanerChore.java:414) at org.apache.hadoop.hbase.master.cleaner.CleanerChore$$Lambda$185/2070209024.run(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000000038a476bf8> (a java.util.concurrent.ThreadPoolExecutor$Worker) {code} In the second file, the threads has finished all the work and are waiting for the new task. That means the cleaner won't be blocked now, it's seems good. > The ForkJoinPool in CleanerChore will spawn thousands of threads in our > cluster with thousands table > ---------------------------------------------------------------------------------------------------- > > Key: HBASE-22867 > URL: https://issues.apache.org/jira/browse/HBASE-22867 > Project: HBase > Issue Type: Bug > Reporter: Zheng Hu > Assignee: Zheng Hu > Priority: Critical > Attachments: 191318.stack, 191318.stack.1, 31162.stack.1 > > > The thousands of spawned threads make the safepoint cost 80+s in our Master > JVM processs. > {code} > 2019-08-15,19:35:35,861 INFO [main-SendThread(zjy-hadoop-prc-zk02.bj:11000)] > org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard > from server in 82260ms for sessionid 0x1691332e2d3aae5, closing socket > connection and at > tempting reconnect > {code} > The stdout from JVM (can see from here there're 9126 threads & sync cost 80+s) > {code} > vmop [threads: total initially_running wait_to_block] > [time: spin block sync cleanup vmop] page_trap_count > 32358.859: ForceAsyncSafepoint [ 9126 67 > 474 ] [ 1 28 86596 87 101 ] 0 > {code} > Also we got the jstack: > {code} > $ cat 31162.stack.1 | grep 'ForkJoinPool-1-worker' | wc -l > 8648 > {code} > It's a dangerous bug, make it as blocker. -- This message was sent by Atlassian Jira (v8.3.2#803003)