[
https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Nauroth updated HDFS-7604:
--------------------------------
Attachment: HDFS-7604.001.patch
The attached patch implements the feature. Summary:
* The protocol definition for heartbeat requests has been changed to add
{{failedStorageLocations}}, which contains multiple strings used to report the
local file system path of each failed storage.
* The DN calculates its failed storage locations as the set difference between
everything configured in {{dfs.datanode.data.dir}} and the current live volumes
in use by the {{FsDatasetImpl}}. Doing it this way works well with the live DN
reconfiguration feature (HDFS-6808), because it will use the current active
configuration rather than what was loaded at process start time.
* The failed storage locations are exposed through {{FSDatasetMBean}}, so the
metrics on an individual DN will publish that information. I also updated
{{FsDatasetImpl#getNumFailedVolumes}} to keep its implementation in sync with
the new method.
* {{FsVolumeList}} no longer needs to track a separate counter of failed
volumes. As a side effect, I believe this is fixing a potential bug with live
DN reconfiguration. (If a previously failed volume was brought back online
through live reconfiguration, then I don't believe this counter would have been
decremented or reset to reflect the new state.)
* On the NN side, the heartbeat handling now updates its data structures to
keep track of the failed storage locations per DN.
* The failed storage locations for all DNs are exposed through
{{FSNamesystemMBean}}. There is also a new counter for the total volume
failures across all DNs.
* The web UI templates have been updated to display the new data.
* {{TestDataNodeVolumeFailureReporting}} contains the testing related to this
feature. I took the opportunity to do a few other minor cleanups in this file.
* Numerous other test files contain minor changes to deal with method signature
changes related to passing the new field in the heartbeat.
> Track and display failed DataNode storage locations in NameNode.
> ----------------------------------------------------------------
>
> Key: HDFS-7604
> URL: https://issues.apache.org/jira/browse/HDFS-7604
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png,
> HDFS-7604-screenshot-3.png, HDFS-7604.001.patch, HDFS-7604.prototype.patch
>
>
> During heartbeats, the DataNode can report a list of its storage locations
> that have been taken out of service due to failure (such as due to a bad disk
> or a permissions problem). The NameNode can track these failed storage
> locations and then report them in JMX and the NameNode web UI.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)