[
https://issues.apache.org/jira/browse/HDFS-6626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069811#comment-14069811
]
Ming Ma commented on HDFS-6626:
-------------------------------
Thanks, Andrew. I discussed more with our admins and they want to identify bad
nodes quickly in the context of decommission. I agree such new state doesn't
help much, given the dead nodes UI can provide such information.
> Node is marked decommissioned if it becomes dead when it is being
> decommissioned
> --------------------------------------------------------------------------------
>
> Key: HDFS-6626
> URL: https://issues.apache.org/jira/browse/HDFS-6626
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Ming Ma
>
> Not sure if it is by design. But it isn't intuitive. The scenario is like
> this, you try to decommission a node; when the node is being decommissioned,
> the node becomes dead from NN's point of view; right after that NN will mark
> this node decommissioned. On the webUI, administrators will consider the
> decommission has completed successfully. That is because when there is no
> block left for the DN, decommission is considered done.
> {noformat}
> BlockManager.java
> boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
> boolean status = false;
> ...
> final Iterator<? extends Block> it = srcNode.getBlockIterator();
> while(it.hasNext()) {
> ...
> // set status if there is block under replication
> }
> ...
> return status;
> }
> {noformat}
> The question is whether we should mark the dead node as decommission
> completed (the current behavior), or mark the dead node "decommission
> aborted". From administrators' point of view, when they are doing decomm,
> they want to know the status of decomm and the health of those
> decomm-in-progress nodes. If they can detect decommission failure earlier,
> they might be able to take actions earlier; for example if the TOR switch has
> issues during decomm, administrators will be able to quickly find out a bunch
> of "decommission aborted" nodes from the same rack. People can still find
> this information by doing the join between decomm node list and recent dead
> node list on the webUI; just not as convenient.
> Suggestions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)