Improved usability around node decommissioning and block replication on
dfshealth.jsp
-------------------------------------------------------------------------------------
Key: HDFS-2849
URL: https://issues.apache.org/jira/browse/HDFS-2849
Project: Hadoop HDFS
Issue Type: New Feature
Components: documentation, name-node
Affects Versions: 0.20.2
Reporter: Jeff Bean
When you do this:
- Decom a single node.
- Underreplicated count reports all blocks.
- Stop decom.
- Underreplication count reduces slowly and heads to 0.
This is expected behavior of HDFS but while this is happening, utilities like
dfshealth.jsp and fsck produce high numbers of underreplicated blocks, and the
node is not on the dead/decommissioned nodes list. It's therefore unclear to
novice administrators and HDFS newbies whether or not this is a failure
condition that needs administrative attention.
Administrators find themselves constantly having to explain the
under-replication number when they could be doing better things with their
time. And they're constantly getting alarms which can be disregarded, raising
fears of a "cry wolf" problem that the real issue gets lost in the noise.
A direct quote from such an administrator:
"When a datanode fails, it's not considered a 'decommissioning', so it does not
show up in that list, it just simply kicks on the underrep and we have to hunt
through the LIVE list and attempt to find out which node caused the issue.
Obviously, we (the community) are not being told on the DEAD list when a node
appears (why this information has to be withheld has always been an issue with
me, how hard is it to put a date field in the DEAD list?)"
Nevertheless, we should have more information about a dying node instead of
seeing a jump in the underrep count from 0 to millions with no real obvious
reason. Perhaps add another column saying 'DYING NODE', anything would help.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira