[
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218302#comment-15218302
]
Nathan Roberts commented on HDFS-9239:
--------------------------------------
bq. However,making it lighter on the datanode side is a good idea. We have seen
many cases where nodes are declared dead because the service actor thread is
delayed/blocked.
Just a quick update on this comment. Even after HDFS-7060 we still had cases
where Datanodes would fail to heartbeat in. We eventually tracked this down to
the RHEL CFQ I/O scheduler. There are situations where significant seek
activity (like a massive shuffle) can cause this I/O scheduler to indefinitely
starve writers. This eventually causes the datanode and/or nodemanager
processes to completely stop (probably due to logging I/O backing up). So, no
matter how smart we make these daemons, they are going to be lost from the
NN/RM point of view in these situations. But, this is actually probably the
right thing to do in these cases, these daemons are clearly not able to do
their job so SHOULD be declared lost.
In any event, the change which we found most valuable for this situation was to
use the deadline I/O scheduler. This dramatically improved the number of lost
datanodes and nodemanagers we were seeing.
> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode
> liveness
> -----------------------------------------------------------------------------------
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch,
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline
> Protocol. This is an RPC protocol that is responsible for reporting liveness
> and basic health information about a DataNode to a NameNode. Compared to the
> existing heartbeat messages, it is lightweight and not prone to resource
> contention problems that can harm accurate tracking of DataNode liveness
> currently. The attached design document contains more details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)