Ming Ma created HDFS-6626:
-----------------------------

             Summary: Node is marked decommissioned if it becomes dead when it 
is being decommissioned
                 Key: HDFS-6626
                 URL: https://issues.apache.org/jira/browse/HDFS-6626
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Ming Ma


Not sure if it is by design. But it isn't intuitive. The scenario is like this, 
you try to decommission a node; when the node is being decommissioned, the node 
becomes dead from NN's point of view; right after that NN will mark this node 
decommissioned. On the webUI, administrators will consider the decommission has 
completed successfully. That is because when there is no block left for the DN, 
decommission is considered done.

{noformat}
BlockManager.java
  boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
    boolean status = false;
...
    final Iterator<? extends Block> it = srcNode.getBlockIterator();
    while(it.hasNext()) {
...
// set status if there is block under replication
    }
...
    return status;
}
{noformat}

The question is whether we should mark the dead node as decommission completed 
(the current behavior), or mark the dead node "decommission aborted". From 
administrators' point of view, when they are doing decomm, they want to know 
the status of decomm and the health of those decomm-in-progress nodes. If they 
can detect decommission failure earlier, they might be able to take actions 
earlier; for example if the TOR switch has issues during decomm, administrators 
will be able to quickly find out a bunch of "decommission aborted" nodes from 
the same rack. People can still find this information by doing the join between 
decomm node list and recent dead node list on the webUI; just not as convenient.

Suggestions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to