[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001314#comment-15001314
 ] 

Daryn Sharp commented on HDFS-9239:
-----------------------------------

There are really 2 problems to solve here.  Ensuring the DN can actually 
heartbeat as Kihwal alluded to.  Ensuring the NN can process it in a reasonable 
time.

In the DN, our main problems with the DN jamming up and not sending heartbeats 
were: 1) commands (finalize) not handled async.  2) getting the du/df metrics 
for the heartbeat blocked because the block layout change paralyzed disks.  
Although finalize is now async, in the more general sense heartbeats response 
commands should always be decoupled from the sending of the heartbeat.

On the NN, the fsn lock could be a problem but in practice, we've not had it 
even with over 5k nodes.  But I really like the approach of making the 
heartbeat stat updates fsn-lockless.  Collecting the commands w/o the lock 
(since it doubles as a operational state lock) isn't trivial or you would have 
done that.  What you could do after the stat update is use tryLock for a short 
time.  If you can't get the lock, oh well, this heartbeat response doesn't get 
any commands.

I'm not sure we need yet another RPC server for this purpose.  NN heartbeat 
processing with a lockless + tryLock implementation would make it ideally 
suited for the existing client and/or service servers.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to