[
https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xudong Cao resolved HDFS-15069.
-------------------------------
Resolution: Duplicate
> DecommissionMonitor-0 thread will block forever while its timer task
> scheduled encountered any unchecked exceptions.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-15069
> URL: https://issues.apache.org/jira/browse/HDFS-15069
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 3.1.3
> Reporter: Xudong Cao
> Assignee: Xudong Cao
> Priority: Major
> Attachments: stack_on_16_12.png, stack_on_16_42.png
>
>
> More than once, we have observed that during decommissioning of a large
> number of DNs, the thread DecommissionMonitor-0 will stop scheduling,
> blocking for a long time, and there will be no exception logs or
> notifications at all.
> e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about
> 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days.
> The stack of DecommissionMonitor-0 looks like this:
> # stack on 2019.12.17 16:12 !stack_on_16_12.png!
> # stack on 2019.12.17 16:42 !stack_on_16_42.png!
> It can be seen that during half an hour, this thread has not been scheduled
> at all, its Waited count has not changed.
> We think the cause of the problem is:
> # The DecommissionMonitor task submitted by NameNode encounters an unchecked
> exception during its running , and then this task will be never executed
> again.
> # But NameNode does not care about the ScheduledFuture of this task, and
> never calls ScheduledFuture.get(), so the unchecked exception thrown by the
> task above will always be placed there, no one knows.
> After that, the subsequent phenomenon is:
> # The ScheduledExecutorService thread DecommissionMonitor-0 will block
> forever in ThreadPoolExecutor.getTask().
> # The previously submitted task DecommissionMonitor will be never executed
> again.
> # No logs or notifications can let us know exactly what had happened.
> Possible solutions:
> # Do not use thread pool to execute decommission monitor task, alternatively
> we can introduce a separate thread to do this, just like HeartbeatManager,
> ReplicationMonitor, LeaseManager, BlockReportThread, and so on.
> OR
> 2. Catch all exceptions in decommission monitor task's run() method,
> so it does not throw any exceptions.
> I prefer the second option.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]