[
https://issues.apache.org/jira/browse/HDFS-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matt Foley updated HDFS-1855:
-----------------------------
Attachment: TestDatanodeBlockScanner_bug_v1.patch
In method blockCorruptionRecoveryPolicy(), 5 nodes are created, 3 with replicas
of a certain block. Two of those replicas, in the nodes at index [0] and [1],
are deliberately corrupted. Then it attempts to restart those two nodes so the
corruption will be detected.
The loop that is intended to restart both datanodes starts with [0]. But when
it restarts [0], it is removed from the MiniCluster's arraylist and re-added to
the end. As a result, [1] moves to [0]. But the loop then restarts the new
[1], which was the former [2], which doesn't contain a corrupt replica. As a
result, the corrupt replica in the former [1] never gets detected.
In resolving the corruption, one of two errors can happen, with probability
50%: Since the namenode thinks it still has two good replicas, it may pick the
corrupt replica as the source for re-replication. That will cause a checksum
error at the receiving node.
Alternatively, it may pick the one valid replica as the source, and replicate
it, and delete the bad replica from the original [0]. However, since it
doesn't know that the replica on the former [1] is corrupt, it never issues the
delete request. This causes the test case to time out on the wait for corrupt
replica deletion.
This problem is resolved by looping from high [1] to low [0], as is done in
certain MiniDFSCluster methods.
> TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in
> two different ways
> -----------------------------------------------------------------------------------------------
>
> Key: HDFS-1855
> URL: https://issues.apache.org/jira/browse/HDFS-1855
> Project: Hadoop HDFS
> Issue Type: Test
> Components: test
> Affects Versions: 0.22.0
> Reporter: Matt Foley
> Assignee: Matt Foley
> Fix For: 0.22.0, 0.23.0
>
> Attachments: TestDatanodeBlockScanner_bug_v1.patch
>
>
> The second part of test case
> TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy(), "corrupt
> replica recovery for two corrupt replicas", always fails, half the time with
> a checksum error upon block replication, and half the time by timing out upon
> failure to delete the second corrupt replica.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira