[ 
https://issues.apache.org/jira/browse/HDFS-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Foley updated HDFS-1855:
-----------------------------

    Attachment: TestDatanodeBlockScanner_bug_v1.patch

In method blockCorruptionRecoveryPolicy(), 5 nodes are created, 3 with replicas 
of a certain block.  Two of those replicas, in the nodes at index [0] and [1], 
are deliberately corrupted.  Then it attempts to restart those two nodes so the 
corruption will be detected.

The loop that is intended to restart both datanodes starts with [0].  But when 
it restarts [0], it is removed from the MiniCluster's arraylist and re-added to 
the end.  As a result, [1] moves to [0].  But the loop then restarts the new 
[1], which was the former [2], which doesn't contain a corrupt replica.  As a 
result, the corrupt replica in the former [1] never gets detected.

In resolving the corruption, one of two errors can happen, with probability 
50%:  Since the namenode thinks it still has two good replicas, it may pick the 
corrupt replica as the source for re-replication.  That will cause a checksum 
error at the receiving node.

Alternatively, it may pick the one valid replica as the source, and replicate 
it, and delete the bad replica from the original [0].  However, since it 
doesn't know that the replica on the former [1] is corrupt, it never issues the 
delete request.  This causes the test case to time out on the wait for corrupt 
replica deletion.

This problem is resolved by looping from high [1] to low [0], as is done in 
certain MiniDFSCluster methods.

> TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in 
> two different ways
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1855
>                 URL: https://issues.apache.org/jira/browse/HDFS-1855
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: test
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>             Fix For: 0.22.0, 0.23.0
>
>         Attachments: TestDatanodeBlockScanner_bug_v1.patch
>
>
> The second part of test case 
> TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy(), "corrupt 
> replica recovery for two corrupt replicas", always fails, half the time with 
> a checksum error upon block replication, and half the time by timing out upon 
> failure to delete the second corrupt replica.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to