[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003219#comment-15003219
 ] 

Ming Ma commented on HDFS-9239:
-------------------------------

Sorry for the jumping in late for the discussion. While we haven't seen any 
recent issues caused by DNs incorrectly marked as dead, maybe this feature 
could mitigate replication storm issue where incorrectly marked DNs will cause 
even more replication?

* It seems the introduction of a new RPC server is to work around the existing 
functionality of RPC which only support QoS based on user names. Image if RPC 
server can provide differentiated service based on method names, then we can 
just add {{sendLifeline}} to existing {{DatanodeProtocol}} and have the same 
RPC server can process the method call at the highest priority. Adding 
method-based RPC QoS could have help other use cases, for example, if we want 
to prioritize existing heartbeat over IBR.
* Regarding the DN contention scenario which blocks it from sending 
{{sendLifeline}} to NN, we could skip all info such as storage reports. But if 
DN is already such state, maybe not sending {{sendLifeline}} is what we want 
anyway.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to