[jira] [Commented] (HDFS-15421) IBR leak causes standby NN to be stuck in safe mode

Konstantin Shvachko (Jira) Thu, 25 Jun 2020 11:53:55 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145743#comment-17145743
 ]


Konstantin Shvachko commented on HDFS-15421:
--------------------------------------------

Great collaboration here guys. Did some digging in the code.
# {{internalReleaseLease()}} can trigger three transactions: {{OP_CLOSE}}, 
{{OP_SET_GENSTAMP}}, {{OP_REASSIGN_LEASE}}. First two already handle genStamp 
correctly with the patch. The last one does not have new genStamp.
# I think adding {{applyImpendingGenerationStamp()}} in {{OP_REASSIGN_LEASE}} 
is incorrect as it restores the race condition of HDFS-14941. And the comment 
is confusing: even though the two transactions are added to edits under the 
common lock, their execution on SBN happens outside the lock and is not atomic.
# Found one more place {{FSEditLogLoader.addNewBlock()}} were we need to add 
{{setGenerationStampIfGreater()}}. {{addNewBlock()}} adds a block with a new 
genStamp.

Here is the list of all operations that can add new genStamp. LMK if I missed 
any
# OP_ADD
# OP_ADD_BLOCK
# OP_UPDATE_BLOCKS
# OP_SET_GENSTAMP
# OP_CLOSE
# OP_TRUNCATE

I think all of them except OP_ADD_BLOCK use {{setGenerationStampIfGreater()}} 
with the latest patch. Worth double checking of course.

> IBR leak causes standby NN to be stuck in safe mode
> ---------------------------------------------------
>
>                 Key: HDFS-15421
>                 URL: https://issues.apache.org/jira/browse/HDFS-15421
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Kihwal Lee
>            Assignee: Akira Ajisaka
>            Priority: Blocker
>              Labels: release-blocker
>         Attachments: HDFS-15421-000.patch, HDFS-15421-001.patch, 
> HDFS-15421.002.patch, HDFS-15421.003.patch, HDFS-15421.004.patch, 
> HDFS-15421.005.patch, HDFS-15421.006.patch, HDFS-15421.007.patch
>
>
> After HDFS-14941, update of the global gen stamp is delayed in certain 
> situations.  This makes the last set of incremental block reports from append 
> "from future", which causes it to be simply re-queued to the pending DN 
> message queue, rather than processed to complete the block.  The last set of 
> IBRs will leak and never cleaned until it transitions to active.  The size of 
> {{pendingDNMessages}} constantly grows until then.
> If a leak happens while in a startup safe mode, the namenode will never be 
> able to come out of safe mode on its own.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15421) IBR leak causes standby NN to be stuck in safe mode

Reply via email to