[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218470#comment-15218470
 ] 

Nathan Roberts commented on HDFS-9239:
--------------------------------------

bq. Just to make sure I'm clear, are you talking about configuring the deadline 
scheduler as described here?

Yes, those links are talking about the right parameters. 

We currently run with read_expire=1000, write_expire=1000, and 
writes_starved=1. Since our I/O workloads change dramatically over time, we 
didn't spend a lot of time looking for optimal values here. These have been 
working well for the last several months across multiple clusters.

As an aside, a relatively easy way to reproduce this problem, is to put a heavy 
seek load on all the disks of a datanode (e.g. 
http://www.linuxinsight.com/how_fast_is_your_disk.html, I believe 5-10 copies 
of seeker were sufficient.) After a minute or so, system becomes almost 
unusable and datanode will be declared lost. This might be a good test to run 
against the lifeline protocol. My hunch is, with CFQ, the datanode will still 
be lost. 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>             Fix For: 2.8.0
>
>         Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to