[ 
https://issues.apache.org/jira/browse/HDFS-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13453682#comment-13453682
 ] 

Andy Isaacson commented on HDFS-3902:
-------------------------------------

bq. TestDatanodeBlockScanner still fails about 1/5 runs in 
testBlockCorruptionRecoveryPolicy2. That's due to a separate test issue also 
uncovered by HDFS-3828.

The failure scenario for this one is a bit more tricky.  I think I've captured 
the scenario below:
# The test corrupts 2/3 replicas. 
# client reports a bad block.
# NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
# DN notices the incoming replica is corrupt and reports it as a bad block, but 
does not inform the NN that re-replication failed.
# NN keeps the block on pendingReplications.
# BP scanner wakes up on both DNs with corrupt blocks, both report corruption.  
NN reports both as duplicates, one from the client and one from the DN report 
above.
# since block is on pendingReplications, NN does not schedule another 
replication.
                
> TestDatanodeBlockScanner is flaky, broke entirely after HDFS-3828
> -----------------------------------------------------------------
>
>                 Key: HDFS-3902
>                 URL: https://issues.apache.org/jira/browse/HDFS-3902
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>         Attachments: hdfs3902.txt
>
>
> Since HDFS-3828 fixed the block scanner to not repeatedly rescan small 
> blockpools, TestDatanodeBlockScanner times out after 13 minutes in 
> {{waitReplication}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to