[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Colin Patrick McCabe (JIRA) Mon, 25 Apr 2016 11:56:35 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256791#comment-15256791
 ]

Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------

bq. [~shv] wrote: The last line is confusing, because it should have been 2, 
but its is 0 since br2 overridden lastBlockReportId for s1 and s2 .

It's OK for it to be 0 here.  It just means that we will not do the zombie 
storage elimination for these particular full block reports.  Remember that 
interleaved block reports are an extremely rare case, and so are zombie 
storages.  We can wait for the next FBR to do the zombie elimination.

bq. I think this could be a simple fix for this jira, and we can discuss other 
approaches to zombie storage detection in the next issue. Current approach 
seems to be error prone. One way is to go with the retry cache as Jing Zhao 
suggested. Or there could be other ideas.

The problem with a retry cache is that it uses up memory.  We don't have an 
easy way to put an upper bound on the amount of memory that we need, except 
through adding complex logic to limit the number of full block reports accepted 
for a specific DataNode in a given time period.

bq. This brought me to an idea. BR ids are monotonically increasing...
The code for generating block report IDs is here:
{code}
  private long generateUniqueBlockReportId() {
    // Initialize the block report ID the first time through.
    // Note that 0 is used on the NN to indicate "uninitialized", so we should
    // not send a 0 value ourselves.
    prevBlockReportId++;
    while (prevBlockReportId == 0) {
      prevBlockReportId = ThreadLocalRandom.current().nextLong();
    }     
    return prevBlockReportId;
  } 
{code}

It's not monotonically increasing in the case where rollover occurs.  While 
this is an extremely rare case, the consequences of getting it wrong would be 
extremely severe.  So this might be possible as an incompatible change, but not 
a change in branch-2.

bq. [~walter.k.su] wrote: If BR is splitted into multipe RPCs, there's no 
interleaving natually because DN get the acked before it sends next RPC. 
Interleaving only exists if BR is not splitted. I agree bug need to be fixed 
from inside, It's just eliminating interleaving for good maybe not a bad idea, 
as it simplifies the problem, and is also a simple workaround for this jira.

We don't document anywhere that interleaving doesn't occur.  We don't have unit 
tests that it doesn't occur, and if we did, those unit tests might accidentally 
pass because of race conditions.  Even if we eliminated interleaving for now, 
anyone changing the RPC code or the queuing code could easily re-introduce 
interleaving and this bug would come back.  That's why I agree with [~shv]-- we 
should not focus on trying to remove interleaving.

bq. [~shv] wrote: I think this could be a simple fix for this jira, and we can 
discuss other approaches to zombie storage detection in the next issue.

Yeah, let's get in this fix and then talk about potential improvements in a 
follow-on jira.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.01.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Reply via email to