[ 
https://issues.apache.org/jira/browse/HDFS-9650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frode Halvorsen updated HDFS-9650:
----------------------------------
    Description: 
Description;
Hadoop 2.7.1. 2 namenodes in HA. 14 Datanodes.

Enough CPU,disk and RAM.

Just discovered that some datanodes must have been corrupted somehow.

When restarting  a 'defect' ( works without failure except when restarting) the 
active namenode suddenly is logging a lot of : "Redundant addStoredBlock 
request received"
and finally the failover-controller takes the namenode down, fails over to 
other node. This node also starts logging the same, and as soon as the fisrt 
node is bac online, the failover-controller again kill the active node, and 
does failover.
This node now was started after the datanode, and doesn't log "Redundant 
addStoredBlock request received" anymore, and restart of the second name-node 
works fine.
If I again restarts the datanode- the process repeats itself.

Problem is logging of "Redundant addStoredBlock request received" and why does 
it happen ? 
The failover-controller acts the same way as it did on 2.5/6 when we had a lot 
of 'block does not belong to any replica'-messages. Namenode is too busy to 
respond to heartbeats, and is taken down...

To resolve this, I have to take down the datanode, delete all data from it, and 
start it up. Then cluster will reproduce the missing blocks, and the failing 
datanode is working fine again...

  was:
Description;
Hadoop 2.7.1. 2 namenodes in HA. 14 Datanodes.

Enough CPU,disk and RAM.

Just discovered that some datanodes must have been corrupted somehow.

When restarting  a 'defect' ( works without failure except when restarting) the 
active namenode suddenly is logging a lot of : "Redundant addStoredBlock 
request received"
and finally the failover-controller takes the namenode down, fails over to 
other node. This node also starts logging the same, and as soon as the fisrt 
node is bac online, the failover-controller again kill the active node, and 
does failover.
This node now was started after the datanode, and doesn't log "Redundant 
addStoredBlock request received" anymore, and restart of the second name-node 
works fine.
If I again restarts the datanode- the process repeats itself.

Problem is logging of "Redundant addStoredBlock request received" and why does 
it happen ? 
The failover-controller acts the same way as it did on 2.5/6 when we had a lot 
of 'block does not belong to any replica'-messages. Namenode is too busy to 
respond to heartbeats, and is taken down...

To resolv this, I have to take down the datanode, delete all data from it, and 
start it up. Then cluster will reproduce the missing blocks, and the failing 
datanode is working fine again...


> Problem is logging of "Redundant addStoredBlock request received"
> -----------------------------------------------------------------
>
>                 Key: HDFS-9650
>                 URL: https://issues.apache.org/jira/browse/HDFS-9650
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Frode Halvorsen
>
> Description;
> Hadoop 2.7.1. 2 namenodes in HA. 14 Datanodes.
> Enough CPU,disk and RAM.
> Just discovered that some datanodes must have been corrupted somehow.
> When restarting  a 'defect' ( works without failure except when restarting) 
> the active namenode suddenly is logging a lot of : "Redundant addStoredBlock 
> request received"
> and finally the failover-controller takes the namenode down, fails over to 
> other node. This node also starts logging the same, and as soon as the fisrt 
> node is bac online, the failover-controller again kill the active node, and 
> does failover.
> This node now was started after the datanode, and doesn't log "Redundant 
> addStoredBlock request received" anymore, and restart of the second name-node 
> works fine.
> If I again restarts the datanode- the process repeats itself.
> Problem is logging of "Redundant addStoredBlock request received" and why 
> does it happen ? 
> The failover-controller acts the same way as it did on 2.5/6 when we had a 
> lot of 'block does not belong to any replica'-messages. Namenode is too busy 
> to respond to heartbeats, and is taken down...
> To resolve this, I have to take down the datanode, delete all data from it, 
> and start it up. Then cluster will reproduce the missing blocks, and the 
> failing datanode is working fine again...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to