[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218302#comment-15218302
 ] 

Nathan Roberts commented on HDFS-9239:
--------------------------------------

bq. However,making it lighter on the datanode side is a good idea. We have seen 
many cases where nodes are declared dead because the service actor thread is 
delayed/blocked. 

Just a quick update on this comment. Even after HDFS-7060 we still had cases 
where Datanodes would fail to heartbeat in. We eventually tracked this down to 
the RHEL CFQ I/O scheduler. There are situations where significant seek 
activity (like a massive shuffle) can cause this I/O scheduler to indefinitely 
starve writers. This eventually causes the datanode and/or nodemanager 
processes to completely stop (probably due to logging I/O backing up). So, no 
matter how smart we make these daemons, they are going to be lost from the 
NN/RM point of view in these situations. But, this is actually probably the 
right thing to do in these cases, these daemons are clearly not able to do 
their job so SHOULD be declared lost. 

In any event, the change which we found most valuable for this situation was to 
use the deadline I/O scheduler. This dramatically improved the number of lost 
datanodes and nodemanagers we were seeing.
 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>             Fix For: 2.8.0
>
>         Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to