[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211582#comment-14211582
 ] 

Andrew Wang commented on HDFS-7374:
-----------------------------------

This issue is definitely tricky. I agree with everyone's discussion thus far, 
thanks especially to [~mingma] for weighing in with insights from HDFS-6791. I 
think Zhe's proposal #2 is good, and we should work on getting it in. As a 
follow-on, we could also consider trying to expose more information to 
operators to help them decide if they should "force decom" by messing with the 
exclude file. The core issue IIUC is knowing if a force decom will result in 
data loss, which could probably be pieced together from fsck, but is by no 
means cheap to do.

With that, some light patch comments:

* I think the logic is a bit wrong right now, since it can shortcut a node from 
(DEAD, DECOM_IN_PROGRESS) to (DEAD, DECOMMED) if refresh is called when the 
node is in the exclude file, where IIUC what we want is to only allow (DEAD, 
NORMAL) to (DEAD, DECOMMED).
* Because of the above, would be good to move this logic into startDecommission 
instead, we also want to be doing some log prints even in this situation
* We can use GenericTestUtils#waitFor to do the waitForDatanodeState, it prints 
a nice stack trace as a benefit.
* Could clean up the imports in TestDeadDatanode, TestDecommissioningStatus
* TestDecomm, could remove the println, line longer than 80chars, could also 
add a test timeout

Thanks for working on this Zhe!

> Allow decommissioning of dead DataNodes
> ---------------------------------------
>
>                 Key: HDFS-7374
>                 URL: https://issues.apache.org/jira/browse/HDFS-7374
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>         Attachments: HDFS-7374-001.patch
>
>
> We have seen the use case of decommissioning DataNodes that are already dead 
> or unresponsive, and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as 
> {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
> the decommission work. If an upper layer application is monitoring the 
> decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to