[
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259529#comment-15259529
]
Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------
bq. Hey Colin, I reviewed your patch more thoroughly. There is still a problem
with interleaving reports. See updateBlockReportContext(). Suppose that block
reports interleave like this: <br1-s1, br2-s1, br1-s2, br2-s2>. Then br1-s2
will reset curBlockReportRpcsSeen since curBlockReportId is not the same as in
the report, which will discard the bit set for s1 in br2-s1, and the count of
rpcsSeen = 0 will be wrong for br2-s2. So possibly unreported (zombie) storages
will not be removed. LMK if you see what I see.
Thanks for looking at the patch. I agree that in the case of interleaving,
zombie storages will not be removed. I don't consider that a problem, since we
will eventually get a non-interleaved full block report that will do the zombie
storage removal. In practice, interleaved block reports are extremely rare (we
have never seen the problem described in this JIRA, after deploying to
thousands of clusters).
bq. May be we should go with a different approach for this problem. Single
block report can be split into multiple RPCs. Within single block-report-RPC
NameNode processes each storage under a lock, but then releases and re-acquires
the lock for the next storage, so that multiple RPC reports can interleave due
to multi-threading.
Maybe I'm misunderstanding the proposal, but don't we already do all of this?
We split block reports into multiple RPCs when the storage reports grow beyond
a certain size.
bq. Approach. DN should report full list of its storages in the first
block-report-RPC. The NameNode first cleans up unreported storages and replicas
belonging them, then start processing the rest of block reports as usually. So
DataNodes explicitly report storages that they have, which eliminates NameNode
guessing, which storage is the last in the block report RPC.
What does the NameNode do if the DataNode is restarted while sending these
RPCs, so that it never gets a chance to send all the storages that it claimed
existed? It seems like you will get stuck and not be able to accept any new
reports. Or, you can take the same approach the current patch does, and clear
the current state every time you see a new ID (but then you can't do zombie
storage elimination in the presence of interleaving.)
One approach that avoids all these problems is to avoid doing zombie storage
elimination during FBRs entirely, and do it instead during DN heartbeats (for
example). DN heartbeats are small messages that are never split, and their
processing is not interleaved with anything.
We agree that the current patch solves the problem of storages falsely being
declared as zombies, I hope. I think that's a good enough reason to get this
patch in, and then think about alternate approaches later.
> BlockReport retransmissions may lead to storages falsely being declared
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.1
> Reporter: Konstantin Shvachko
> Assignee: Colin Patrick McCabe
> Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch,
> HDFS-10301.01.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it
> sends the block report again. Then NameNode while process these two reports
> at the same time can interleave processing storages from different reports.
> This screws up the blockReportId field, which makes NameNode think that some
> storages are zombie. Replicas from zombie storages are immediately removed,
> causing missing blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)