[
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001314#comment-15001314
]
Daryn Sharp commented on HDFS-9239:
-----------------------------------
There are really 2 problems to solve here. Ensuring the DN can actually
heartbeat as Kihwal alluded to. Ensuring the NN can process it in a reasonable
time.
In the DN, our main problems with the DN jamming up and not sending heartbeats
were: 1) commands (finalize) not handled async. 2) getting the du/df metrics
for the heartbeat blocked because the block layout change paralyzed disks.
Although finalize is now async, in the more general sense heartbeats response
commands should always be decoupled from the sending of the heartbeat.
On the NN, the fsn lock could be a problem but in practice, we've not had it
even with over 5k nodes. But I really like the approach of making the
heartbeat stat updates fsn-lockless. Collecting the commands w/o the lock
(since it doubles as a operational state lock) isn't trivial or you would have
done that. What you could do after the stat update is use tryLock for a short
time. If you can't get the lock, oh well, this heartbeat response doesn't get
any commands.
I'm not sure we need yet another RPC server for this purpose. NN heartbeat
processing with a lockless + tryLock implementation would make it ideally
suited for the existing client and/or service servers.
> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode
> liveness
> -----------------------------------------------------------------------------------
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline
> Protocol. This is an RPC protocol that is responsible for reporting liveness
> and basic health information about a DataNode to a NameNode. Compared to the
> existing heartbeat messages, it is lightweight and not prone to resource
> contention problems that can harm accurate tracking of DataNode liveness
> currently. The attached design document contains more details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)