[
https://issues.apache.org/jira/browse/HDFS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299223#comment-14299223
]
Chris Nauroth commented on HDFS-7714:
-------------------------------------
Here are more details on what I've observed. I saw that the main
{{BPServiceActor#run}} loop was active for one NameNode, but for the other one,
it had reported the fatal "Initialization failed" error from this part of the
code:
{code}
while (true) {
// init stuff
try {
// setup storage
connectToNNAndHandshake();
break;
} catch (IOException ioe) {
// Initial handshake, storage recovery or registration failed
runningState = RunningState.INIT_FAILED;
if (shouldRetryInit()) {
// Retry until all namenode's of BPOS failed initialization
LOG.error("Initialization failed for " + this + " "
+ ioe.getLocalizedMessage());
sleepAndLogInterrupts(5000, "initializing");
} else {
runningState = RunningState.FAILED;
LOG.fatal("Initialization failed for " + this + ". Exiting. ", ioe);
return;
}
}
}
{code}
The {{ioe}} was an {{EOFException}} while trying the {{registerDatanode}} RPC.
Lining up timestamps from NN and DN logs, I could see that the NN had restarted
at the same time, causing it to abandon this RPC connection, ultimately
triggering the {{EOFException}} on the DataNode side.
Most importantly, the fact that it was on the code path with the fatal-level
logging means that it would never reattempt registration with this NameNode.
{{shouldRetryInit()}} must have returned {{false}}. The implementation of
{{BPOfferService#shouldRetryInit}} is that it should only retry if the other
one already registered successfully:
{code}
/*
* Let the actor retry for initialization until all namenodes of cluster have
* failed.
*/
boolean shouldRetryInit() {
if (hasBlockPoolId()) {
// One of the namenode registered successfully. lets continue retry for
// other.
return true;
}
return isAlive();
}
{code}
Tying that all together, this bug happens when the first attempted NameNode
registration fails but the second succeeds. The DataNode process remains
running, but with only one live {{BPServiceActor}}.
HDFS-2882 had a lot of discussion of DataNode startup failure scenarios. I
think the summary of that discussion is that the DataNode should in general
retry its NameNode registrations, but it should instead abort right away if
there is no possibility for registration to be successful. (i.e. There is a
misconfiguration or a hardware failure.) I think the change we need here is
that we should keep retrying the {{registerDatanode}} RPC if there is NameNode
downtime or transient connectivity failure. Other failure reasons should still
cause an abort.
> Simultaneous restart of HA NameNodes and DataNode can cause DataNode to
> register successfully with only one NameNode.
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-7714
> URL: https://issues.apache.org/jira/browse/HDFS-7714
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0
> Reporter: Chris Nauroth
>
> In an HA deployment, DataNodes must register with both NameNodes and send
> periodic heartbeats and block reports to both. However, if NameNodes and
> DataNodes are restarted simultaneously, then this can trigger a race
> condition in registration. The end result is that the {{BPServiceActor}} for
> one NameNode terminates, but the {{BPServiceActor}} for the other NameNode
> remains alive. The DataNode process is then in a "half-alive" state where it
> only heartbeats and sends block reports to one of the NameNodes. This could
> cause a loss of storage capacity after an HA failover. The DataNode process
> would have to be restarted to resolve this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)