[
https://issues.apache.org/jira/browse/HDFS-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226850#comment-13226850
]
Harsh J commented on HDFS-2849:
-------------------------------
The centric trouble here comes from the fact that some ops care about the
global under replicated count, which is fair as there isn't a granular option.
This is a finding-signal-in-noise issue. We'll need to divide metrics into ones
they can then choose among to care about.
If there aren't metrics today for decommissioning-blocks count, we can add them
in and those who wish to continue to monitor global under-replicated count can
subtract the decommissioning-pending block counts off of it and be done?
> Improved usability around node decommissioning and block replication on
> dfshealth.jsp
> -------------------------------------------------------------------------------------
>
> Key: HDFS-2849
> URL: https://issues.apache.org/jira/browse/HDFS-2849
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: documentation, name-node
> Affects Versions: 0.20.2
> Reporter: Jeff Bean
>
> When you do this:
> - Decom a single node.
> - Underreplicated count reports all blocks.
> - Stop decom.
> - Underreplication count reduces slowly and heads to 0.
> This is expected behavior of HDFS but while this is happening, utilities like
> dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and
> the node is not on the dead/decommissioned nodes list. It's therefore unclear
> to novice administrators and HDFS newbies whether or not this is a failure
> condition that needs administrative attention.
> Administrators find themselves constantly having to explain the
> under-replication number when they could be doing better things with their
> time. And they're constantly getting alarms which can be disregarded, raising
> fears of a "cry wolf" problem that the real issue gets lost in the noise.
> A direct quote from such an administrator:
> "When a datanode fails, it's not considered a 'decommissioning', so it does
> not show up in that list, it just simply kicks on the underrep and we have to
> hunt through the LIVE list and attempt to find out which node caused the
> issue. Obviously, we (the community) are not being told on the DEAD list when
> a node appears (why this information has to be withheld has always been an
> issue with me, how hard is it to put a date field in the DEAD list?)"
> Nevertheless, we should have more information about a dying node instead of
> seeing a jump in the underrep count from 0 to millions with no real obvious
> reason. Perhaps add another column saying 'DYING NODE', anything would help.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira