[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202348#comment-14202348
 ] 

Zhe Zhang commented on HDFS-7374:
---------------------------------

[~mingma] Thanks much for clarifying the state machine. I agree my option #2 is 
cleaner and makes the decommissioning of dead nodes much faster. I'll go ahead 
with that approach now. 

bq. If the node stays in Dead, DECOMMISSION_INPROGRESS for too long, have the 
higher layer application remove the node from exclude file and thus abort the 
decommission process. This will transition the node to Dead, NORMAL.

The specific higher layer application in my case is Cloudera Manager and I 
think it's possible to add this logic. However I don't know how easy it is to 
change all similar management applications.

bq.  HDFS-6791 mentioned another way to address the original issue. When nodes 
become dead, mark them DECOMMISSIONED and fix the replication to handle this 
case. In other words, get rid of Dead, DECOMMISSION_INPROGRESS state.

Do you mean allowing a {{DECOMMISSIONED}} node to be the source of a replica 
transfer? It seems a little fragile to me; intuitively, it could surprise upper 
layer applications that a {{DECOMMISSIONED}} node is still actively 
transferring data. But I would like to hear the opinions from other people.

> Allow decommissioning of dead DataNodes
> ---------------------------------------
>
>                 Key: HDFS-7374
>                 URL: https://issues.apache.org/jira/browse/HDFS-7374
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to