[ 
https://issues.apache.org/jira/browse/HDFS-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226850#comment-13226850
 ] 

Harsh J commented on HDFS-2849:
-------------------------------

The centric trouble here comes from the fact that some ops care about the 
global under replicated count, which is fair as there isn't a granular option. 
This is a finding-signal-in-noise issue. We'll need to divide metrics into ones 
they can then choose among to care about.

If there aren't metrics today for decommissioning-blocks count, we can add them 
in and those who wish to continue to monitor global under-replicated count can 
subtract the decommissioning-pending block counts off of it and be done?
                
> Improved usability around node decommissioning and block replication on 
> dfshealth.jsp
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-2849
>                 URL: https://issues.apache.org/jira/browse/HDFS-2849
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: documentation, name-node
>    Affects Versions: 0.20.2
>            Reporter: Jeff Bean
>
> When you do this:
>     - Decom a single node.
>     - Underreplicated count reports all blocks.
>     - Stop decom.
>     - Underreplication count reduces slowly and heads to 0.
> This is expected behavior of HDFS but while this is happening, utilities like 
> dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and 
> the node is not on the dead/decommissioned nodes list. It's therefore unclear 
> to novice administrators and HDFS newbies whether or not this is a failure 
> condition that needs administrative attention. 
> Administrators find themselves constantly having to explain the 
> under-replication number when they could be doing better things with their 
> time. And they're constantly getting alarms which can be disregarded, raising 
> fears of a "cry wolf" problem that the real issue gets lost in the noise.
> A direct quote from such an administrator:
> "When a datanode fails, it's not considered a 'decommissioning', so it does 
> not show up in that list, it just simply kicks on the underrep and we have to 
> hunt through the LIVE list and attempt to find out which node caused the 
> issue. Obviously, we (the community) are not being told on the DEAD list when 
> a node appears (why this information has to be withheld has always been an 
> issue with me, how hard is it to put a date field in the DEAD list?)"
> Nevertheless, we should have more information about a dying node instead of 
> seeing a jump in the underrep count from 0 to millions with no real obvious 
> reason. Perhaps add another column saying 'DYING NODE', anything would help.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to