[
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419823#comment-15419823
]
Colin P. McCabe commented on HDFS-10301:
----------------------------------------
I don't think the heartbeat is the right place to handle reconciling the block
storages. One reason is because this adds extra complexity and time to the
heartbeat, which happens far more frequently than an FBR. We even talked about
making the heartbeat lockless-- clearly you can't do that if you are traversing
all the block storages. Taking the FSN lock is expensive and heartbeats are
sent quite frequently from each DN-- every few seconds. Another reason
reconciling storages in heartbeats is bad is because if the heartbeat tells you
about a new storage, you won't know what blocks are in it until the FBR
arrives. So the NN may end up assigning a bunch of new blocks to a storage
which looks empty, but really is full.
I came up with what I believe is the correct patch to fix this problem months
ago. It's here as
https://issues.apache.org/jira/secure/attachment/12805931/HDFS-10301.005.patch
. It doesn't modify any RPCs or add any new mechanisms. Instead, it just
fixes the obvious bug in the HDFS-7960 logic. The only counter-argument to
applying patch 005 that anyone ever came up with is that it doesn't eliminate
zombies when FBRs get interleaved. But this is not a good counter-argument,
since FBR interleaving is extremely, extremely rare in well-run clusters. The
proof should be obvious-- if FBR interleaving happened on more clusters, more
people would hit this serious data loss bug.
This JIRA has been extremely frustrating. It seems like most, if not all, of
the points that I brought up in my reviews were ignored. I talked about the
obvious problems with compatibility with [~shv]'s solution and even explicitly
asked him to test the upgrade case. I told him that this JIRA was a bad one to
give to a promising new contributor such as [~redvine], because it required a
lot of context and was extremely tricky. Both myself and [~andrew.wang]
commented that overloading BlockListAsLongs was confusing and not necessary.
The patch confused "not modifying the .proto file" with "not modifying the RPC
content" which are two very separate concepts, as I commented over and over.
Clearly these comments were ignored. If anything, I think [~shv] got very
lucky that the bug manifested itself quickly rather than creating a serious
data loss situation a few months down the road, like the one I had to debug
when fixing HDFS-7960.
Again I would urge you to just commit patch 005. Or at least evaluate it.
> BlockReport retransmissions may lead to storages falsely being declared
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.1
> Reporter: Konstantin Shvachko
> Assignee: Vinitha Reddy Gankidi
> Priority: Critical
> Fix For: 2.7.4
>
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch,
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch,
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch,
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch,
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch,
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it
> sends the block report again. Then NameNode while process these two reports
> at the same time can interleave processing storages from different reports.
> This screws up the blockReportId field, which makes NameNode think that some
> storages are zombie. Replicas from zombie storages are immediately removed,
> causing missing blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]