[jira] [Comment Edited] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.

Chris Nauroth (JIRA) Mon, 02 Feb 2015 14:06:59 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302025#comment-14302025
 ]


Chris Nauroth edited comment on HDFS-7604 at 2/2/15 10:05 PM:
--------------------------------------------------------------

I've done another mock-up of the UI.  This version avoids adding clutter to the 
existing Datanodes page and instead moves failure information to its own 
dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for 
Total Failed Volumes.  I also intend to display lost capacity in parentheses 
next to it.  However, unlike last time, the existing Datanodes page is 
unchanged.  Instead, the volume failure information is on a new Datanode 
Volumes page, shown in new screenshot 4.  This is hyperlinked from both the 
Total Failed Volumes field in the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume 
failures.  For each one, it displays the address, seconds since last contact, 
time of last volume failure, number of failed volumes, estimated capacity lost 
due to these volume failures, and a list of every failed storage location's 
path.  I say that the capacity lost is an estimate, because there are going to 
be some edge cases that could prevent us from displaying accurate information 
here.  For example, if a volume has an I/O error before we get a chance to 
check its capacity, then it's unknown how much storage is available on that 
volume.

The end user workflow I imagine for this is that an admin first checks the 
summary information and notices a non-zero count for failed volumes.  Then, the 
admin navigates to the Datanode Volumes page to get a list of volume failures 
across the cluster.  This view lists only the DataNodes with volume failures, 
so the admin won't need to scan through the master list looking for individual 
nodes with a non-zero volume failure count.  This can act as a sort of work 
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to 
provide this data for the UI.  Meanwhile, Last Failure Time and Estimated 
Capacity Lost are displayed as TODO in the screenshot.  Further feedback is 
welcome while I continue coding a new patch.


was (Author: cnauroth):
I've done another mock-up of the UI.  This version avoids adding clutter to the 
existing Datanodes page and instead moves failure information to its own 
dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for 
Total Failed Volumes.  I also intend to display lost capacity in parentheses 
next to it.  However, unlike last time, the existing Datanodes page is 
unchanged.  Instead, the volume failure information is on a new Datanode 
Volumes page.  This is hyperlinked from both the Total Failed Volumes field in 
the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume 
failures.  For each one, it displays the address, seconds since last contact, 
time of last volume failure, number of failed volumes, estimated capacity lost 
due to these volume failures, and a list of every failed storage location's 
path.  I say that the capacity lost is an estimate, because there are going to 
be some edge cases that could prevent us from displaying accurate information 
here.  For example, if a volume has an I/O error before we get a chance to 
check its capacity, then it's unknown how much storage is available on that 
volume.

The end user workflow I imagine for this is that an admin first checks the 
summary information and notices a non-zero count for failed volumes.  Then, the 
admin navigates to the Datanode Volumes page to get a list of volume failures 
across the cluster.  This view lists only the DataNodes with volume failures, 
so the admin won't need to scan through the master list looking for individual 
nodes with a non-zero volume failure count.  This can act as a sort of work 
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to 
provide this data for the UI.  Meanwhile, Last Failure Time and Estimated 
Capacity Lost are displayed as TODO in the screenshot.  Further feedback is 
welcome while I continue coding a new patch.

> Track and display failed DataNode storage locations in NameNode.
> ----------------------------------------------------------------
>
>                 Key: HDFS-7604
>                 URL: https://issues.apache.org/jira/browse/HDFS-7604
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, 
> HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, HDFS-7604.001.patch, 
> HDFS-7604.prototype.patch
>
>
> During heartbeats, the DataNode can report a list of its storage locations 
> that have been taken out of service due to failure (such as due to a bad disk 
> or a permissions problem).  The NameNode can track these failed storage 
> locations and then report them in JMX and the NameNode web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.

Reply via email to