[
https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302025#comment-14302025
]
Chris Nauroth edited comment on HDFS-7604 at 2/2/15 10:05 PM:
--------------------------------------------------------------
I've done another mock-up of the UI. This version avoids adding clutter to the
existing Datanodes page and instead moves failure information to its own
dedicated page.
Just like in the existing screenshot 3, there is a new field on the summary for
Total Failed Volumes. I also intend to display lost capacity in parentheses
next to it. However, unlike last time, the existing Datanodes page is
unchanged. Instead, the volume failure information is on a new Datanode
Volumes page, shown in new screenshot 4. This is hyperlinked from both the
Total Failed Volumes field in the summary and a new tab in the top nav.
The new page has a table displaying only the DataNodes that have volume
failures. For each one, it displays the address, seconds since last contact,
time of last volume failure, number of failed volumes, estimated capacity lost
due to these volume failures, and a list of every failed storage location's
path. I say that the capacity lost is an estimate, because there are going to
be some edge cases that could prevent us from displaying accurate information
here. For example, if a volume has an I/O error before we get a chance to
check its capacity, then it's unknown how much storage is available on that
volume.
The end user workflow I imagine for this is that an admin first checks the
summary information and notices a non-zero count for failed volumes. Then, the
admin navigates to the Datanode Volumes page to get a list of volume failures
across the cluster. This view lists only the DataNodes with volume failures,
so the admin won't need to scan through the master list looking for individual
nodes with a non-zero volume failure count. This can act as a sort of work
queue for the admin recovering or replacing disks.
I have not updated the patch. I need to rework the heartbeat information to
provide this data for the UI. Meanwhile, Last Failure Time and Estimated
Capacity Lost are displayed as TODO in the screenshot. Further feedback is
welcome while I continue coding a new patch.
was (Author: cnauroth):
I've done another mock-up of the UI. This version avoids adding clutter to the
existing Datanodes page and instead moves failure information to its own
dedicated page.
Just like in the existing screenshot 3, there is a new field on the summary for
Total Failed Volumes. I also intend to display lost capacity in parentheses
next to it. However, unlike last time, the existing Datanodes page is
unchanged. Instead, the volume failure information is on a new Datanode
Volumes page. This is hyperlinked from both the Total Failed Volumes field in
the summary and a new tab in the top nav.
The new page has a table displaying only the DataNodes that have volume
failures. For each one, it displays the address, seconds since last contact,
time of last volume failure, number of failed volumes, estimated capacity lost
due to these volume failures, and a list of every failed storage location's
path. I say that the capacity lost is an estimate, because there are going to
be some edge cases that could prevent us from displaying accurate information
here. For example, if a volume has an I/O error before we get a chance to
check its capacity, then it's unknown how much storage is available on that
volume.
The end user workflow I imagine for this is that an admin first checks the
summary information and notices a non-zero count for failed volumes. Then, the
admin navigates to the Datanode Volumes page to get a list of volume failures
across the cluster. This view lists only the DataNodes with volume failures,
so the admin won't need to scan through the master list looking for individual
nodes with a non-zero volume failure count. This can act as a sort of work
queue for the admin recovering or replacing disks.
I have not updated the patch. I need to rework the heartbeat information to
provide this data for the UI. Meanwhile, Last Failure Time and Estimated
Capacity Lost are displayed as TODO in the screenshot. Further feedback is
welcome while I continue coding a new patch.
> Track and display failed DataNode storage locations in NameNode.
> ----------------------------------------------------------------
>
> Key: HDFS-7604
> URL: https://issues.apache.org/jira/browse/HDFS-7604
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, namenode
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png,
> HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, HDFS-7604.001.patch,
> HDFS-7604.prototype.patch
>
>
> During heartbeats, the DataNode can report a list of its storage locations
> that have been taken out of service due to failure (such as due to a bad disk
> or a permissions problem). The NameNode can track these failed storage
> locations and then report them in JMX and the NameNode web UI.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)