[ 
https://issues.apache.org/jira/browse/HDFS-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464910#comment-13464910
 ] 

Uma Maheswara Rao G commented on HDFS-3982:
-------------------------------------------

{quote}
since block is on pendingReplications, NN does not schedule another replication.
{quote}
After pending replication timed out , NN not scheduling for replication again? 
NN will add the blocks from pendingReplications to neededReplications if they 
timedout. On successful replication pendingReplications anyway will be removed.

Currently if replication and cluster size is same, then It won't replicate it 
as there is no new node to copy block as existing nodes already has 
blocks(corrupt or good). Until there are enough number of good replicas, it 
won't invalidate any block. See some details in HDFS-3586. I am not sure this 
issue is same or similar to it, please take a look once and confirm please.
                
> report failed replications in DN heartbeat
> ------------------------------------------
>
>                 Key: HDFS-3982
>                 URL: https://issues.apache.org/jira/browse/HDFS-3982
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 2.0.2-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>
> From HDFS-3931:
> {quote}
> # The test corrupts 2/3 replicas.
> # client reports a bad block.
> # NN asks a DN to re-replicate, and randomly picks the other corrupt replica.
> # DN notices the incoming replica is corrupt and reports it as a bad block, 
> but does not inform the NN that re-replication failed.
> # NN keeps the block on pendingReplications.
> # BP scanner wakes up on both DNs with corrupt blocks, both report 
> corruption. NN reports both as duplicates, one from the client and one from 
> the DN report above.
> since block is on pendingReplications, NN does not schedule another 
> replication.
> Todd wrote:
> I can think of a few ways to fix this:
> ...
>  2) Add a field to the DN heartbeat which reports back a failed replication 
> for a given block. The NN would use this to decrement the pendingReplication 
> count, which would cause a new replication attempt to be made if it was still 
> under-replicated.
> This jira tracks implementing the DN heartbeat replication failure report.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to