[
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724932#comment-14724932
]
Yi Liu commented on HDFS-8995:
------------------------------
Yes, in the case of re-registration failure, datanode will get
{{UnregisteredNodeException}} from NN while doing further incremental block
report and heartbeat, and cause BP-xxx service shutdown. And we can see the
exception log.
{quote}
The fix is not saving the registration until the NN updates it
{quote}
Agree.
My comment is the change in {{BPOfferService#registrationSucceeded}} and
{{DataNode#bpRegistrationSucceeded}} is necessary? Since re-registration
failure will throw exception, and only successful registration will go to that
logic and update the variables if they are not null. But I think it's also OK
to update them every time when registration or re-registration.
So +1 pending Jenkins.
> Flaw in registration bookeeping can make DN die on reconnect
> ------------------------------------------------------------
>
> Key: HDFS-8995
> URL: https://issues.apache.org/jira/browse/HDFS-8995
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
> Attachments: HDFS-8995.patch
>
>
> Normally data nodes re-register with the namenode when it was unreachable for
> more than the heartbeat expiration and becomes reachable again. Datanodes
> keep retrying the last rpc call such as incremental block report and
> heartbeat and when it finally gets through the namenode tells it to
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further
> investigation has revealed that those were told to shutdown by the namenode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)