[ 
https://issues.apache.org/jira/browse/HDFS-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464924#comment-13464924
 ] 

Colin Patrick McCabe commented on HDFS-3982:
--------------------------------------------

In step #4, why doesn't the DN receiving the corrupt replica simply receive it 
and flag it as corrupt?  Then the block will no longer be on 
pendingReplications, until the NN notices that the block needs to be 
re-replicated because 2 of its 3 replicas are corrupt.  No need for any special 
flags or fields?

If it takes us a long time to re-replicate blocks that have only 1 non-corrupt 
replica, that seems like a separate problem that we should fix, not hack 
around?  Unless I'm missing something here.
                
> report failed replications in DN heartbeat
> ------------------------------------------
>
>                 Key: HDFS-3982
>                 URL: https://issues.apache.org/jira/browse/HDFS-3982
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 2.0.2-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>
> From HDFS-3931:
> {quote}
> # The test corrupts 2/3 replicas.
> # client reports a bad block.
> # NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> # DN notices the incoming replica is corrupt and reports it as a bad block, 
> but does not inform the NN that re-replication failed.
> # NN keeps the block on pendingReplications.
> # BP scanner wakes up on both DNs with corrupt blocks, both report 
> corruption. NN reports both as duplicates, one from the client and one from 
> the DN report above.
> since block is on pendingReplications, NN does not schedule another 
> replication.
> Todd wrote:
> I can think of a few ways to fix this:
> ...
>  2) Add a field to the DN heartbeat which reports back a failed replication 
> for a given block. The NN would use this to decrement the pendingReplication 
> count, which would cause a new replication attempt to be made if it was still 
> under-replicated.
> This jira tracks implementing the DN heartbeat replication failure report.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to