[ 
https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999396#comment-16999396
 ] 

Íñigo Goiri commented on HDFS-15069:
------------------------------------

Is this the related to HDFS-12703?

> DecommissionMonitor-0 thread will block forever while its timer task 
> scheduled encountered any unchecked exceptions.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15069
>                 URL: https://issues.apache.org/jira/browse/HDFS-15069
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Major
>         Attachments: stack_on_16_12.png, stack_on_16_42.png
>
>
> More than once, we have observed that during decommissioning of a large 
> number of DNs, the thread DecommissionMonitor-0 will stop scheduling, 
> blocking for a long time, and there will be no exception logs or 
> notifications at all.
> e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 
> 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days.
> The stack of DecommissionMonitor-0 looks like this:
>  # stack on 2019.12.17 16:12  !stack_on_16_12.png!
>  # stack on 2019.12.17 16:42  !stack_on_16_42.png!
> It can be seen that during half an hour, this thread has not been scheduled 
> at all, its Waited count has not changed.
> We think the cause of the problem is:
>  # The DecommissionMonitor task submitted by NameNode encounters an unchecked 
> exception during its running , and then this task will be never executed 
> again.
>  # But NameNode does not care about the ScheduledFuture of this task, and 
> never calls ScheduledFuture.get(), so the unchecked exception thrown by the 
> task above will always be placed there, no one knows.
> After that, the subsequent phenomenon is:
>  # The ScheduledExecutorService thread DecommissionMonitor-0 will block 
> forever in ThreadPoolExecutor.getTask().
>  # The previously submitted task DecommissionMonitor will be never executed 
> again.
>  # No logs or notifications can let us know exactly what had happened.
> Possible solutions:
>  # Do not use thread pool to execute decommission monitor task, alternatively 
> we can introduce a separate thread to do this, just like HeartbeatManager, 
> ReplicationMonitor, LeaseManager, BlockReportThread, and so on.
>        OR
>        2. Catch all exceptions in decommission monitor task's run() method, 
> so it does not throw any exceptions.
> I prefer the second option.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to