[
https://issues.apache.org/jira/browse/HDFS-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460038#comment-13460038
]
Andy Isaacson commented on HDFS-3931:
-------------------------------------
bq. I'm surprised that there's a scenario where 20 is insufficient but it still
eventually does get replicated correctly.
The scenario is
- NN requests replication, puts block on pending replication queue.
- requested replication source is corrupt, replication fails.
- eventually NN times out pending replication queue and tries again. Block
scanner has run and marked replica as corrupt so eventually we are guaranteed
the replication source is not corrupt.
Without cranking DFS_NAMENODE_REPLICATION_PENDING_TIMEOUT_SEC_KEY down, we fail
1/2 tries. After speeding up pending replication timeouts, we fail
1/2*1/2*1/2... until the waitReplication loop bails. If the blockScanner has a
chance to run and mark replicas corrupt, and after blockScanner finishes the
replication queue has time to run, then we succeed reliably.
Without increasing the timeout from 20 seconds to 40 seconds, the test fails
1/3 times.
bq. Why is restarting the datanodes ITERATIONS times necessary?
As the test is currently structured, we're polling to notice that the corrupt
replica count has increased. With the fixed blockscanner from HDFS-3828, the
test fails to hit the timing window to notice corrupt replicas because the
blockscanner sometimes (1/20 in my tests IIRC) wins a race and the NN deletes
the corrupt replica before the test can poll it. If this happens, we have to
go back and re-try the whole "corrupt two replicas, restart their DNs, poll"
loop.
So it's not "restarting the datanodes" that we are retrying here. The thing
we're retrying is "corrupt the replicas, restart the DNs, watch the corrupt
count to verify it changed".
> TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
> -------------------------------------------------------------
>
> Key: HDFS-3931
> URL: https://issues.apache.org/jira/browse/HDFS-3931
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: test
> Affects Versions: 2.0.0-alpha
> Reporter: Eli Collins
> Assignee: Andy Isaacson
> Attachments: hdfs3931-1.txt, hdfs3931-2.txt, hdfs3931-3.txt,
> hdfs3931.txt
>
>
> Per Andy's comment on HDFS-3902:
> TestDatanodeBlockScanner still fails about 1/5 runs in
> testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also
> uncovered by HDFS-3828.
> The failure scenario for this one is a bit more tricky. I think I've captured
> the scenario below:
> - The test corrupts 2/3 replicas.
> - client reports a bad block.
> - NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> - DN notices the incoming replica is corrupt and reports it as a bad block,
> but does not inform the NN that re-replication failed.
> - NN keeps the block on pendingReplications.
> - BP scanner wakes up on both DNs with corrupt blocks, both report
> corruption. NN reports both as duplicates, one from the client and one from
> the DN report above.
> since block is on pendingReplications, NN does not schedule another
> replication.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira