[ 
https://issues.apache.org/jira/browse/HBASE-22867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912869#comment-16912869
 ] 

Zheng Hu commented on HBASE-22867:
----------------------------------

Attached two jstack files:  191318.stack and 191318.stack.1 . 
I got the 191318.stack file firstly,   after few seconds,  caught the 
191318.stack.1 files.   In the first file,  we can clearly see that there are 6 
threads in dir-scan-pool  which was blocked and waiting for the 
SnapshotHFileCleaner#getDeletableFiles.  
{code}
"dir-scan-pool4-thread-8" #18765 daemon prio=5 os_prio=0 tid=0x00007f4a20009c60 
nid=0x6576 waiting for monitor entry [0x00007f48a6191000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:68)
        - waiting to lock <0x000000034411dc88> (a 
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:295)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.lambda$traverseAndDelete$1(CleanerChore.java:405)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$$Lambda$187/1141106127.act(Unknown
 Source)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:460)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:405)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.lambda$null$2(CleanerChore.java:414)
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$$Lambda$185/2070209024.run(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - <0x000000038a476bf8> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{code}
In the second file,  the threads has finished all the work and are waiting for 
the new task.  That means the cleaner won't be blocked now, it's seems good.

> The ForkJoinPool in CleanerChore will spawn thousands of threads in our 
> cluster with thousands table
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22867
>                 URL: https://issues.apache.org/jira/browse/HBASE-22867
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Zheng Hu
>            Assignee: Zheng Hu
>            Priority: Critical
>         Attachments: 191318.stack, 191318.stack.1, 31162.stack.1
>
>
> The thousands of spawned  threads make the safepoint cost 80+s in our Master 
> JVM processs.
> {code}
> 2019-08-15,19:35:35,861 INFO [main-SendThread(zjy-hadoop-prc-zk02.bj:11000)] 
> org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard 
> from server in 82260ms for sessionid 0x1691332e2d3aae5, closing socket 
> connection and at
> tempting reconnect
> {code}
> The stdout from JVM (can see from here there're 9126 threads & sync cost 80+s)
> {code}
> vmop                    [threads: total initially_running wait_to_block]    
> [time: spin block sync cleanup vmop] page_trap_count
> 32358.859: ForceAsyncSafepoint              [    9126         67            
> 474    ]      [     1    28 86596    87   101    ]  0
> {code}
> Also we got the jstack: 
> {code}
> $ cat 31162.stack.1  | grep 'ForkJoinPool-1-worker' | wc -l
> 8648
> {code}
> It's a dangerous bug, make it as blocker.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to