[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Colin P. McCabe (JIRA) Fri, 12 Aug 2016 23:17:33 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419823#comment-15419823
 ]


Colin P. McCabe commented on HDFS-10301:
----------------------------------------

I don't think the heartbeat is the right place to handle reconciling the block 
storages.  One reason is because this adds extra complexity and time to the 
heartbeat, which happens far more frequently than an FBR.  We even talked about 
making the heartbeat lockless-- clearly you can't do that if you are traversing 
all the block storages.  Taking the FSN lock is expensive and heartbeats are 
sent quite frequently from each DN-- every few seconds.  Another reason 
reconciling storages in heartbeats is bad is because if the heartbeat tells you 
about a new storage, you won't know what blocks are in it until the FBR 
arrives.  So the NN may end up assigning a bunch of new blocks to a storage 
which looks empty, but really is full.

I came up with what I believe is the correct patch to fix this problem months 
ago.  It's here as 
https://issues.apache.org/jira/secure/attachment/12805931/HDFS-10301.005.patch 
.  It doesn't modify any RPCs or add any new mechanisms.  Instead, it just 
fixes the obvious bug in the HDFS-7960 logic.  The only counter-argument to 
applying patch 005 that anyone ever came up with is that it doesn't eliminate 
zombies when FBRs get interleaved.  But this is not a good counter-argument, 
since FBR interleaving is extremely, extremely rare in well-run clusters.  The 
proof should be obvious-- if FBR interleaving happened on more clusters, more 
people would hit this serious data loss bug.

This JIRA has been extremely frustrating.  It seems like most, if not all, of 
the points that I brought up in my reviews were ignored.  I talked about the 
obvious problems with compatibility with [~shv]'s solution and even explicitly 
asked him to test the upgrade case.  I told him that this JIRA was a bad one to 
give to a promising new contributor such as [~redvine], because it required a 
lot of context and was extremely tricky.  Both myself and [~andrew.wang] 
commented that overloading BlockListAsLongs was confusing and not necessary.  
The patch confused "not modifying the .proto file" with "not modifying the RPC 
content" which are two very separate concepts, as I commented over and over.  
Clearly these comments were ignored.  If anything, I think [~shv] got very 
lucky that the bug manifested itself quickly rather than creating a serious 
data loss situation a few months down the road, like the one I had to debug 
when fixing HDFS-7960.

Again I would urge you to just commit patch 005.  Or at least evaluate it.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>             Fix For: 2.7.4
>
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Reply via email to