[ 
https://issues.apache.org/jira/browse/HDFS-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459247#comment-13459247
 ] 

Andy Isaacson commented on HDFS-3931:
-------------------------------------

bq. Looks like the patch needs to be rebased on trunk. hdfs3931-1.txt looks 
cleaner so maybe -2.txt is the wrong patch? 

I uploaded a {{git diff -b}} by accident, sorry.  I've uploaded -3.txt which 
should be right.

bq. I don't think we double waitReplication for everyone.

What's the downside to increasing waitReplication's iteration count?  We don't 
have failures frequently hitting this path, and there aren't any users that 
catch the TimeoutException.  I certainly could add a parameter, but it seems 
like just increasing the timeout from 20 seconds to 40 seconds is fine, since 
it only affects tests that were going to fail.

bq. Ditto for waitCorruptReplicas, adding a version that lets you pass an 
attempts value would be cleaner

The patch doesn't change the behavior of waitCorruptReplicas, and it's not 
obvious to me how I could hoist the loop in testBlockCorruptionPolicy2 down 
into waitCorruptReplicas.
                
> TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
> -------------------------------------------------------------
>
>                 Key: HDFS-3931
>                 URL: https://issues.apache.org/jira/browse/HDFS-3931
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Andy Isaacson
>         Attachments: hdfs3931-1.txt, hdfs3931-2.txt, hdfs3931-3.txt, 
> hdfs3931.txt
>
>
> Per Andy's comment on HDFS-3902:
> TestDatanodeBlockScanner still fails about 1/5 runs in 
> testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also 
> uncovered by HDFS-3828.
> The failure scenario for this one is a bit more tricky. I think I've captured 
> the scenario below:
> - The test corrupts 2/3 replicas.
> - client reports a bad block.
> - NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> - DN notices the incoming replica is corrupt and reports it as a bad block, 
> but does not inform the NN that re-replication failed.
> - NN keeps the block on pendingReplications.
> - BP scanner wakes up on both DNs with corrupt blocks, both report 
> corruption. NN reports both as duplicates, one from the client and one from 
> the DN report above.
> since block is on pendingReplications, NN does not schedule another 
> replication.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to