[
https://issues.apache.org/jira/browse/HADOOP-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583204#action_12583204
]
Hairong Kuang commented on HADOOP-3050:
---------------------------------------
After examining the log, it looks that we got the following scenario:
1. blk_167544198419718831 was replicated to datanode 1, datanode 2, and
datanode 3;
2. Datanode 1 lost contact with the namenode and datanode 2 is scheduled to be
decomissioned.
3. Datanode 1 reregistered with the namenode; but the block report came in
before its network location was resolved; so its block report was dropped.
4. Because the namenode does not know that datanode 1 has the
blk_167544198419718831, it schedules to replicate the block to datanode 1 and
datanode 4.
5. The replication of the block failed because it already has the block.
6. No additional block report was received until the end of the log. So the
block replication kept on failing.
> Cluster fall into infinite loop trying to replicate a block to a target that
> aready has this replica.
> -----------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3050
> URL: https://issues.apache.org/jira/browse/HADOOP-3050
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.17.0
> Reporter: Konstantin Shvachko
> Assignee: Hairong Kuang
> Priority: Blocker
> Attachments: FailedTestDecommission.log
>
>
> This happened during a test run by Hudson. So fortunately we have all logs
> present.
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1987/console
> Search for TestDecommission. And look for block blk_167544198419718831 that
> is being replicated to node 127.0.0.1:65168 over and over again.
> The issue needs to be investigated. I am making it a blocker until it is.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.