[jira] [Updated] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

Kihwal Lee (JIRA) Mon, 31 Aug 2015 07:36:35 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kihwal Lee updated HDFS-8995:
-----------------------------
    Description: 
Normally data nodes re-register with the namenode when it was unreachable for 
more than the heartbeat expiration and becomes reachable again. Datanodes keep 
retrying the last rpc call such as incremental block report and heartbeat and 
when it finally gets through the namenode tells it to re-register.

We have observed that some of datanodes stay dead in such scenarios. Further 
investigation has revealed that those were told to shutdown by the namenode.

  was:
Normally data nodes re-register with the namenode when it was unreachable for 
more than the heartbeat expiration and becomes reachable again. Datanodes keep 
retrying the last rpc call such as incremental block report and heartbeat and 
when it finally gets through the namenode tells it to re-register.

We have observed some of datanodes stay dead in such scenarios. Further 
investigation has revealed that those were told to shutdown by the namenode.


> Flaw in registration bookeeping can make DN die on reconnect
> ------------------------------------------------------------
>
>                 Key: HDFS-8995
>                 URL: https://issues.apache.org/jira/browse/HDFS-8995
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Kihwal Lee
>            Priority: Critical
>
> Normally data nodes re-register with the namenode when it was unreachable for 
> more than the heartbeat expiration and becomes reachable again. Datanodes 
> keep retrying the last rpc call such as incremental block report and 
> heartbeat and when it finally gets through the namenode tells it to 
> re-register.
> We have observed that some of datanodes stay dead in such scenarios. Further 
> investigation has revealed that those were told to shutdown by the namenode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8995) Flaw in registration bookeeping can make DN die on reconnect

Reply via email to