[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253236#comment-15253236 ]
Walter Su commented on HDFS-10301: ---------------------------------- I like your idea of counting storages with same reportId, and no purge if there's any interleaving. I guest {{rpcsSeen}} can be removed or replaced by {{storagesSeen}}? Processing the retransmissioned reports is kind of wasting resource. I think the best approach is as Colin said, "to remove existing DataNode storage report RPCs with the old ID from the queue when we receive one with a new block report ID." Let's consider it as an optimization in another jira. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Priority: Critical > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.01.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)