[ 
https://issues.apache.org/jira/browse/HDFS-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xudong Cao updated HDFS-15069:
------------------------------
    Description: 
More than once, we have observed that during decommissioning of a large number 
of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a 
long time, and there will be no exception logs or notifications at all.

e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 
10TB, and the DecommissionMonitor-0 thread blocked for about 15 days.

The stack of DecommissionMonitor-0 looks like this:
 # stack on 2019.12.17 16:12  !stack_on_16_12.png!
 # stack on 2019.12.17 16:42  !stack_on_16_42.png!

It can be seen that during half an hour, this thread has not been scheduled at 
all, its Waited count has not changed.

We think the cause of the problem is:
 # The DecommissionMonitor task submitted by NameNode encounters an unchecked 
exception during its running , and then this task will be never executed again.
 # But NameNode does not care about the ScheduledFuture of this task, and never 
calls ScheduledFuture.get(), so the unchecked exception thrown by the task 
above will always be placed there, no one knows.

After that, the subsequent phenomenon is:
 # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever 
in ThreadPoolExecutor.getTask ().
 # The previously submitted task DecommissionMonitor will be never executed 
again.
 # No logs or notifications let us know exactly what had happened.

  was:
More than once, we have observed that during decommissioning of a large number 
of DNs, the thread DecommissionMonitor-0 will stop scheduling, blocking for a 
long time, and there will be no exception logs or notifications at all.

e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 
10TB, and the DecommissionMonitor-0 thread blocked for about 15 days.

The stack of DecommissionMonitor-0 looks like this:
 #  stack on 2019.12.17 16:12  !stack_on_16_12.png!
 # stack on 2019.12.17 16:42  !stack_on_16_42.png!

It can be seen that during half an hour, this thread has not been scheduled at 
all, its Waited count has not changed.

We think the cause of the problem is:
 # The DecommissionMonitor task submitted by NameNode encounters an unchecked 
exception during its running , and then this task will be never executed again.
 # But NameNode does not care about the ScheduledFuture of this task, and never 
calls ScheduledFuture.get(), so the unchecked exception thrown by the task 
above will always be placed there, no one knows.

After that, the subsequent phenomenon is:
 # The ScheduledExecutorService thread DecommissionMonitor-0 will block forever 
in ThreadPoolExecutor.getTask ().
 # The previously submitted task DecommissionMonitor will be never executed 
again.
 # No logs or notifications let us know exactly what had happened.


> DecommissionMonitor thread will block forever while it encountered an 
> unchecked exception.
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15069
>                 URL: https://issues.apache.org/jira/browse/HDFS-15069
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.1.3
>            Reporter: Xudong Cao
>            Assignee: Xudong Cao
>            Priority: Major
>         Attachments: stack_on_16_12.png, stack_on_16_42.png
>
>
> More than once, we have observed that during decommissioning of a large 
> number of DNs, the thread DecommissionMonitor-0 will stop scheduling, 
> blocking for a long time, and there will be no exception logs or 
> notifications at all.
> e.g. Recently, we are decommissioning 65 DNs at the same time, each DN about 
> 10TB, and the DecommissionMonitor-0 thread blocked for about 15 days.
> The stack of DecommissionMonitor-0 looks like this:
>  # stack on 2019.12.17 16:12  !stack_on_16_12.png!
>  # stack on 2019.12.17 16:42  !stack_on_16_42.png!
> It can be seen that during half an hour, this thread has not been scheduled 
> at all, its Waited count has not changed.
> We think the cause of the problem is:
>  # The DecommissionMonitor task submitted by NameNode encounters an unchecked 
> exception during its running , and then this task will be never executed 
> again.
>  # But NameNode does not care about the ScheduledFuture of this task, and 
> never calls ScheduledFuture.get(), so the unchecked exception thrown by the 
> task above will always be placed there, no one knows.
> After that, the subsequent phenomenon is:
>  # The ScheduledExecutorService thread DecommissionMonitor-0 will block 
> forever in ThreadPoolExecutor.getTask ().
>  # The previously submitted task DecommissionMonitor will be never executed 
> again.
>  # No logs or notifications let us know exactly what had happened.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to