[ 
https://issues.apache.org/jira/browse/HDFS-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14115285#comment-14115285
 ] 

Kihwal Lee commented on HDFS-6964:
----------------------------------

More details about the case:
Because of node outages, there were a large number of under-replicated blocks. 
At the same time, space was not well balanced, so many nodes were full and  
replication was slow due to HDFS-6965.  Many pending replications were timing 
out and placed back to the under-replacated block queue. These blocks were 
quickly rescheduled for replication by the replication monitor.  The block that 
eventually ended up missing was not scheduled for replication at all during 
this time. The only remaining source node did not get any such replication 
command and PendingReplicationMonitor never saw replication of this block 
expiring.  The source node had been replicating other blocks without any 
problem until the node was rest hours later.

There are few possibilities.
- The removal of the two nodes (heartbeat expiration) did not cause the block 
to be added to the under-replicated queue.
- The block was added to the queue, but somehow removed without generating a 
replication work.

> NN fails to fix under replication leading to data loss
> ------------------------------------------------------
>
>                 Key: HDFS-6964
>                 URL: https://issues.apache.org/jira/browse/HDFS-6964
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Priority: Blocker
>
> We've encountered lost blocks due to node failure even when there is ample 
> time to fix the under-replication.
> 2 nodes were lost.  The 3rd node with the last remaining replicas averaged 1 
> copy block per heartbeat (3s) until ~7h later when that node was lost 
> resulting in over 50 lost blocks.  When the node was restarted and sent its 
> BR the NN immediately began fixing the replication.
> In another data loss event, over 150 blocks were lost due to node failure but 
> the timing of the node loss is not known so there may have been inadequate 
> time to fix the under-replication unlike the first case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to