[ 
https://issues.apache.org/jira/browse/HDFS-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454235#comment-13454235
 ] 

Todd Lipcon commented on HDFS-3931:
-----------------------------------

I can think of a few ways to fix this:

1) Consider it a test error, and have the test set the pending replication 
timeout to a much lower value (like 5 secs). This is under the assumption that 
this is a rare phenomenon, and in practice we are OK waiting 10 minutes to make 
a new replica when this happens.

2) Add a field to the DN heartbeat which reports back a failed replication for 
a given block. The NN would use this to decrement the pendingReplication count, 
which would cause a new replication attempt to be made if it was still 
under-replicated.


Option 1 is clearly less risky, since it's a test-only change, but option 2 is 
probably "righter" and has the advantage of reducing the under-replication 
window in some rare cases.
                
> TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
> -------------------------------------------------------------
>
>                 Key: HDFS-3931
>                 URL: https://issues.apache.org/jira/browse/HDFS-3931
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Andy Isaacson
>
> Per Andy's comment on HDFS-3902:
> TestDatanodeBlockScanner still fails about 1/5 runs in 
> testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also 
> uncovered by HDFS-3828.
> The failure scenario for this one is a bit more tricky. I think I've captured 
> the scenario below:
> - The test corrupts 2/3 replicas.
> - client reports a bad block.
> - NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> - DN notices the incoming replica is corrupt and reports it as a bad block, 
> but does not inform the NN that re-replication failed.
> - NN keeps the block on pendingReplications.
> - BP scanner wakes up on both DNs with corrupt blocks, both report 
> corruption. NN reports both as duplicates, one from the client and one from 
> the DN report above.
> since block is on pendingReplications, NN does not schedule another 
> replication.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to