[
https://issues.apache.org/jira/browse/HDFS-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13871385#comment-13871385
]
Daryn Sharp commented on HDFS-5773:
-----------------------------------
Issue was seen on 0.23 and believed to be 2.x but is not confirmed. After a
flow control issue described in HADOOP-10233, the NN's heartbeat manager over
the course of an hour marked all the DNs dead after waking up from lengthy GC
pauses.
The nodes do not appear to have attempted re-registeration between re-sending
blockReceivedDeleted messages. The replication monitor went crazy as the nodes
died off, possibly eliciting the blockReceivedDeleted messages (that were
rejected) from the "dead" nodes.
> NN may reject formerly dead DNs
> -------------------------------
>
> Key: HDFS-5773
> URL: https://issues.apache.org/jira/browse/HDFS-5773
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.0.0-alpha, 3.0.0, 0.23.10
> Reporter: Daryn Sharp
> Priority: Critical
>
> If the heartbeat monitor declares a node dead, it may never allow a DN to
> rejoin. The NN will generate messages like "Got blockReceivedDeleted message
> from unregistered or dead node".
> There appears to be a bug where the the isAlive flag is not set to true when
> a formerly known DN attempts to rejoin.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)