[ 
https://issues.apache.org/jira/browse/HDFS-3931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459300#comment-13459300
 ] 

Eli Collins commented on HDFS-3931:
-----------------------------------

bq. What's the downside to increasing waitReplication's iteration count? We 
don't have failures frequently hitting this path, and there aren't any users 
that catch the TimeoutException. I certainly could add a parameter, but it 
seems like just increasing the timeout from 20 seconds to 40 seconds is fine, 
since it only affects tests that were going to fail.

I think the answer I'm looking for is "the test is timing out in 
waitReplication because sometimes it takes more than 20 iterations". It's not 
clear to me the timeout is because 20 iterations is insufficient or the failure 
is due to us timing out because there's a race where we'll *never* reach all 
the iterations (because the block will never be sufficiently replicated). If 
it's the latter then bumping the count doesn't help, so I presume it's the 
former then? I'm surprised that there's a scenario where 20 is insufficient but 
it still eventually does get replicated correctly.

bq. The patch doesn't change the behavior of waitCorruptReplicas,

I was suggesting that waitCorruptReplicas could just wait longer rather than 
loop in the test. Why is restarting the datanodes ITERATIONS times necessary?  
Is the DN restart necessary to kick the block scanner? In which case would it 
make the test more reliable to just trigger the DN block scan directly rather 
than indirectly via restart?
                
> TestDatanodeBlockScanner#testBlockCorruptionPolicy2 is broken
> -------------------------------------------------------------
>
>                 Key: HDFS-3931
>                 URL: https://issues.apache.org/jira/browse/HDFS-3931
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Andy Isaacson
>         Attachments: hdfs3931-1.txt, hdfs3931-2.txt, hdfs3931-3.txt, 
> hdfs3931.txt
>
>
> Per Andy's comment on HDFS-3902:
> TestDatanodeBlockScanner still fails about 1/5 runs in 
> testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also 
> uncovered by HDFS-3828.
> The failure scenario for this one is a bit more tricky. I think I've captured 
> the scenario below:
> - The test corrupts 2/3 replicas.
> - client reports a bad block.
> - NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> - DN notices the incoming replica is corrupt and reports it as a bad block, 
> but does not inform the NN that re-replication failed.
> - NN keeps the block on pendingReplications.
> - BP scanner wakes up on both DNs with corrupt blocks, both report 
> corruption. NN reports both as duplicates, one from the client and one from 
> the DN report above.
> since block is on pendingReplications, NN does not schedule another 
> replication.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to