[
https://issues.apache.org/jira/browse/HDFS-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-4799:
------------------------------
Attachment: hdfs-4799-unittest.txt
Here's a hacky unit test with which I'm able to reproduce the issue (at least
some of the time). The eventual unit test will be cleaned up from here, just
wanted to post this in case anyone wanted to look at the issue.
> Corrupt replica can be prematurely removed from corruptReplicas map
> -------------------------------------------------------------------
>
> Key: HDFS-4799
> URL: https://issues.apache.org/jira/browse/HDFS-4799
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.0.4-alpha
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Blocker
> Attachments: hdfs-4799-unittest.txt
>
>
> We saw the following sequence of events in a cluster result in losing the
> most recent genstamp of a block:
> - client is writing to a pipeline of 3
> - the pipeline had nodes fail over some period of time, such that it left 3
> old-genstamp replicas on the original three nodes, having recruited 3 new
> replicas with a later genstamp.
> -- so, we have 6 total replicas in the cluster, three with old genstamps on
> downed nodes, and 3 with the latest genstamp
> - cluster reboots, and the nodes with old genstamps blockReport first. The
> replicas are correctly added to the corrupt replicas map since they have a
> too-old genstamp
> - the nodes with the new genstamp block report. When the latest one block
> reports, chooseExcessReplicates is called and incorrectly decides to remove
> the three good replicas, leaving only the old-genstamp replicas.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira