[ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211582#comment-14211582 ]
Andrew Wang commented on HDFS-7374: ----------------------------------- This issue is definitely tricky. I agree with everyone's discussion thus far, thanks especially to [~mingma] for weighing in with insights from HDFS-6791. I think Zhe's proposal #2 is good, and we should work on getting it in. As a follow-on, we could also consider trying to expose more information to operators to help them decide if they should "force decom" by messing with the exclude file. The core issue IIUC is knowing if a force decom will result in data loss, which could probably be pieced together from fsck, but is by no means cheap to do. With that, some light patch comments: * I think the logic is a bit wrong right now, since it can shortcut a node from (DEAD, DECOM_IN_PROGRESS) to (DEAD, DECOMMED) if refresh is called when the node is in the exclude file, where IIUC what we want is to only allow (DEAD, NORMAL) to (DEAD, DECOMMED). * Because of the above, would be good to move this logic into startDecommission instead, we also want to be doing some log prints even in this situation * We can use GenericTestUtils#waitFor to do the waitForDatanodeState, it prints a nice stack trace as a benefit. * Could clean up the imports in TestDeadDatanode, TestDecommissioningStatus * TestDecomm, could remove the println, line longer than 80chars, could also add a test timeout Thanks for working on this Zhe! > Allow decommissioning of dead DataNodes > --------------------------------------- > > Key: HDFS-7374 > URL: https://issues.apache.org/jira/browse/HDFS-7374 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Zhe Zhang > Assignee: Zhe Zhang > Attachments: HDFS-7374-001.patch > > > We have seen the use case of decommissioning DataNodes that are already dead > or unresponsive, and not expected to rejoin the cluster. > The logic introduced by HDFS-6791 will mark those nodes as > {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish > the decommission work. If an upper layer application is monitoring the > decommissioning progress, it will hang forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)