[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259529#comment-15259529
 ] 

Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------

bq. Hey Colin, I reviewed your patch more thoroughly. There is still a problem 
with interleaving reports. See updateBlockReportContext(). Suppose that block 
reports interleave like this: <br1-s1, br2-s1, br1-s2, br2-s2>. Then br1-s2 
will reset curBlockReportRpcsSeen since curBlockReportId is not the same as in 
the report, which will discard the bit set for s1 in br2-s1, and the count of 
rpcsSeen = 0 will be wrong for br2-s2. So possibly unreported (zombie) storages 
will not be removed. LMK if you see what I see.

Thanks for looking at the patch.  I agree that in the case of interleaving, 
zombie storages will not be removed.  I don't consider that a problem, since we 
will eventually get a non-interleaved full block report that will do the zombie 
storage removal.  In practice, interleaved block reports are extremely rare (we 
have never seen the problem described in this JIRA, after deploying to 
thousands of clusters).

bq. May be we should go with a different approach for this problem.  Single 
block report can be split into multiple RPCs. Within single block-report-RPC 
NameNode processes each storage under a lock, but then releases and re-acquires 
the lock for the next storage, so that multiple RPC reports can interleave due 
to multi-threading.

Maybe I'm misunderstanding the proposal, but don't we already do all of this?  
We split block reports into multiple RPCs when the storage reports grow beyond 
a certain size.

bq. Approach. DN should report full list of its storages in the first 
block-report-RPC. The NameNode first cleans up unreported storages and replicas 
belonging them, then start processing the rest of block reports as usually. So 
DataNodes explicitly report storages that they have, which eliminates NameNode 
guessing, which storage is the last in the block report RPC.

What does the NameNode do if the DataNode is restarted while sending these 
RPCs, so that it never gets a chance to send all the storages that it claimed 
existed?  It seems like you will get stuck and not be able to accept any new 
reports.  Or, you can take the same approach the current patch does, and clear 
the current state every time you see a new ID (but then you can't do zombie 
storage elimination in the presence of interleaving.)

One approach that avoids all these problems is to avoid doing zombie storage 
elimination during FBRs entirely, and do it instead during DN heartbeats (for 
example).  DN heartbeats are small messages that are never split, and their 
processing is not interleaved with anything.

We agree that the current patch solves the problem of storages falsely being 
declared as zombies, I hope.  I think that's a good enough reason to get this 
patch in, and then think about alternate approaches later.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.01.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to