[
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959826#comment-14959826
]
Daryn Sharp commented on HDFS-9239:
-----------------------------------
It seems like a good idea at first, but I don't think the proposal solves the
stated issues:
* This <protocol> prevents the NameNode from spuriously marking healthy
DataNodes as stale or dead.
* ... delayed DataNodes may be flagged as stale, and applications may
erroneously choose to avoid accessing those nodes
* ... DataNodes may be flagged as dead. In extreme cases, this can cause a
NameNode to schedule wasteful rereplication activity.
Let's say the NN can't service heartbeats to avoid false-staleness (stale
defaults to 30s). That means it definitely can't process IBRs either. Would a
lifeline to prevent the stale flag matter at this point? At this level of
congestion, nearly all of the nodes are going stale. The staleness is probably
the least of your worries. If nodes are marked dead from inability to keep up
with heartbeats (defaults to ~10min), the cluster itself is already. Worrying
about wasted replications is dubious because the NN can't issue replications if
it can't process the heartbeats.
That is not a heavy load scenario. From personal experience, it sounds like
the fallout of a 120GB+ heap stop-the-world GC. The NN wakes up, heartbeat
monitor starts marking everything dead. This sparks a replication storm,
followed by invalidation storm, which the NN recovers from... unless it goes
into another full GC. The lifeline might help slow the rise of false-dead
nodes. However, I recently patched the heartbeat monitor to detect long GCs
and be very gracious before marking nodes dead.
If I've misinterpreted anything, please describe the incident that prompted
this approach so we can see if it would have helped.
> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode
> liveness
> -----------------------------------------------------------------------------------
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline
> Protocol. This is an RPC protocol that is responsible for reporting liveness
> and basic health information about a DataNode to a NameNode. Compared to the
> existing heartbeat messages, it is lightweight and not prone to resource
> contention problems that can harm accurate tracking of DataNode liveness
> currently. The attached design document contains more details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)