[ 
https://issues.apache.org/jira/browse/HDFS-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936917#comment-14936917
 ] 

Kihwal Lee commented on HDFS-8676:
----------------------------------

The patch looks good in general. There is a potential problem, which was part 
of the original code.  There is a precondition check that can throw 
{{IllegalStateException}}, which is a {{RuntimeException}}.  This can cause 
offerService() to blow up in the middle of heartbeat response processing. For 
example, important command like {{DNA_ACCESSKEYUPDATE}} can be dropped.  
Instead of blowing up in the middle, it should log {{ERROR}} and move on.  I 
suggest changing it to a combination of {{assert}} and conditional statement. 
{{assert}} will make sure it blows up in testing, so we will know if something 
is obviously broken. In production, the conditional statement will log the 
error message and simply skip the deletion.

> Delayed rolling upgrade finalization can cause heartbeat expiration
> -------------------------------------------------------------------
>
>                 Key: HDFS-8676
>                 URL: https://issues.apache.org/jira/browse/HDFS-8676
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Assignee: Walter Su
>            Priority: Critical
>         Attachments: HDFS-8676.01.patch
>
>
> In big busy clusters where the deletion rate is also high, a lot of blocks 
> can pile up in the datanode trash directories until an upgrade is finalized.  
> When it is finally finalized, the deletion of trash is done in the service 
> actor thread's context synchronously.  This blocks the heartbeat and can 
> cause heartbeat expiration.  
> We have seen a namenode losing hundreds of nodes after a delayed upgrade 
> finalization.  The deletion of trash directories should be made asynchronous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to