[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253181#comment-15253181
 ] 

Walter Su commented on HDFS-10301:
----------------------------------

bq. Enabling HDFS-9198 will fifo process BRs. It doesn't solve this 
implementation bug but virtually eliminates it from occurring.
bq. This addresses Daryn's comment rather than solving the reported bug, as BTW 
Daryn correctly stated.
that's incorrect. Please run the test in 001 patch with-and-without the fix, 
you'll see the difference. It does solve the issue. Because, 

The bug only exists when reports are contained in one rpc. If they are splitted 
into multiple RPCs, it's not problem, because the {{rpcsSeen}} guard prevent it 
from happening. So, my approach is to process reports contained in one rpc 
contiguously, by putting them into the queue atomically.


> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.01.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to