[ 
https://issues.apache.org/jira/browse/HDFS-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196276#comment-13196276
 ] 

Suresh Srinivas commented on HDFS-2849:
---------------------------------------

bq. This is expected behavior of HDFS but while this is happening, utilities 
like dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and 
the node is not on the dead/decommissioned nodes list. It's therefore unclear 
to novice administrators and HDFS newbies whether or not this is a failure 
condition that needs administrative attention. 

I am not sure I understand the description. 

Under replicated blocks could be due to datanode being dead or decommission is 
in progress. If it is due to dead datanode, then it is in dead list. If it is 
due to decommissioning, in 1.0 release the decommissioning nodes are listed and 
there is a separate page for showing decom status. The datanode should be 
listed in decommissioning list.


bq. Nevertheless, we should have more information about a dying node instead of 
seeing a jump in the underrep count from 0 to millions with no real obvious 
reason. Perhaps add another column saying 'DYING NODE', anything would help.

I am not sure what you mean by "DYING NODE". The node should already be dead as 
far as namenode is concerned, for blocks to be considered to be under 
replicated.

                
> Improved usability around node decommissioning and block replication on 
> dfshealth.jsp
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-2849
>                 URL: https://issues.apache.org/jira/browse/HDFS-2849
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: documentation, name-node
>    Affects Versions: 0.20.2
>            Reporter: Jeff Bean
>
> When you do this:
>     - Decom a single node.
>     - Underreplicated count reports all blocks.
>     - Stop decom.
>     - Underreplication count reduces slowly and heads to 0.
> This is expected behavior of HDFS but while this is happening, utilities like 
> dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and 
> the node is not on the dead/decommissioned nodes list. It's therefore unclear 
> to novice administrators and HDFS newbies whether or not this is a failure 
> condition that needs administrative attention. 
> Administrators find themselves constantly having to explain the 
> under-replication number when they could be doing better things with their 
> time. And they're constantly getting alarms which can be disregarded, raising 
> fears of a "cry wolf" problem that the real issue gets lost in the noise.
> A direct quote from such an administrator:
> "When a datanode fails, it's not considered a 'decommissioning', so it does 
> not show up in that list, it just simply kicks on the underrep and we have to 
> hunt through the LIVE list and attempt to find out which node caused the 
> issue. Obviously, we (the community) are not being told on the DEAD list when 
> a node appears (why this information has to be withheld has always been an 
> issue with me, how hard is it to put a date field in the DEAD list?)"
> Nevertheless, we should have more information about a dying node instead of 
> seeing a jump in the underrep count from 0 to millions with no real obvious 
> reason. Perhaps add another column saying 'DYING NODE', anything would help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to