[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252612#comment-15252612
 ] 

Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------

I have posted a new patch, which I posted as HDFS-10301.002.patch.  The idea 
here is that we know the number of storage reports we expect to see in the 
block report.  We should not be removing any storages as zombies unless we have 
seen this number of storages and marked these storages with the ID of the 
latest block report.

I feel that this approach is better than the one used in 001.patch, since it 
correctly handles the "interleaved" case.  It is very difficult to prove that 
we can never get interleaved storage reports for the DataNode.  This is because 
of issues like queuing inside the RPCs system, packets getting reordered or 
delayed by the network, and queuing inside the deferred work mechanism added by 
HDFS-9198.  So we should handle this case correctly.

> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.01.patch, 
> zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to