[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959826#comment-14959826
 ] 

Daryn Sharp commented on HDFS-9239:
-----------------------------------

It seems like a good idea at first, but I don't think the proposal solves the 
stated issues:
* This <protocol> prevents the NameNode from spuriously marking healthy 
DataNodes as stale or dead.
* ... delayed DataNodes may be flagged as stale, and applications may 
erroneously choose to avoid accessing those nodes
* ... DataNodes may be flagged as dead. In extreme cases, this can cause a 
NameNode to schedule wasteful re­replication activity.

Let's say the NN can't service heartbeats to avoid false-staleness (stale 
defaults to 30s). That means it definitely can't process IBRs either. Would a 
lifeline to prevent the stale flag matter at this point?  At this level of 
congestion, nearly all of the nodes are going stale.  The staleness is probably 
the least of your worries.  If nodes are marked dead from inability to keep up 
with heartbeats (defaults to ~10min), the cluster itself is already.  Worrying 
about wasted replications is dubious because the NN can't issue replications if 
it can't process the heartbeats.

That is not a heavy load scenario.  From personal experience, it sounds like 
the fallout of a 120GB+ heap stop-the-world GC.  The NN wakes up, heartbeat 
monitor starts marking everything dead.  This sparks a replication storm, 
followed by invalidation storm, which the NN recovers from... unless it goes 
into another full GC.  The lifeline might help slow the rise of false-dead 
nodes.  However, I recently patched the heartbeat monitor to detect long GCs 
and be very gracious before marking nodes dead.

If I've misinterpreted anything, please describe the incident that prompted 
this approach so we can see if it would have helped.

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to