[
https://issues.apache.org/jira/browse/HDFS-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459247#comment-13459247
]
Andy Isaacson commented on HDFS-3931:
-------------------------------------
bq. Looks like the patch needs to be rebased on trunk. hdfs3931-1.txt looks
cleaner so maybe -2.txt is the wrong patch?
I uploaded a {{git diff -b}} by accident, sorry. I've uploaded -3.txt which
should be right.
bq. I don't think we double waitReplication for everyone.
What's the downside to increasing waitReplication's iteration count? We don't
have failures frequently hitting this path, and there aren't any users that
catch the TimeoutException. I certainly could add a parameter, but it seems
like just increasing the timeout from 20 seconds to 40 seconds is fine, since
it only affects tests that were going to fail.
bq. Ditto for waitCorruptReplicas, adding a version that lets you pass an
attempts value would be cleaner
The patch doesn't change the behavior of waitCorruptReplicas, and it's not
obvious to me how I could hoist the loop in testBlockCorruptionPolicy2 down
into waitCorruptReplicas.
> TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
> -------------------------------------------------------------
>
> Key: HDFS-3931
> URL: https://issues.apache.org/jira/browse/HDFS-3931
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: test
> Affects Versions: 2.0.0-alpha
> Reporter: Eli Collins
> Assignee: Andy Isaacson
> Attachments: hdfs3931-1.txt, hdfs3931-2.txt, hdfs3931-3.txt,
> hdfs3931.txt
>
>
> Per Andy's comment on HDFS-3902:
> TestDatanodeBlockScanner still fails about 1/5 runs in
> testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also
> uncovered by HDFS-3828.
> The failure scenario for this one is a bit more tricky. I think I've captured
> the scenario below:
> - The test corrupts 2/3 replicas.
> - client reports a bad block.
> - NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> - DN notices the incoming replica is corrupt and reports it as a bad block,
> but does not inform the NN that re-replication failed.
> - NN keeps the block on pendingReplications.
> - BP scanner wakes up on both DNs with corrupt blocks, both report
> corruption. NN reports both as duplicates, one from the client and one from
> the DN report above.
> since block is on pendingReplications, NN does not schedule another
> replication.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira