[jira] [Comment Edited] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

Ye Ni (Jira) Mon, 04 Jan 2021 11:46:07 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258452#comment-17258452
 ]


Ye Ni edited comment on HDFS-15761 at 1/4/21, 7:45 PM:
-------------------------------------------------------

cc [~mingma], [~andrew.wang], [~zhz] , [~inigoiri]


was (Author: nickyye):
cc [~mingma], [~andrew.wang], [~aiden_zhang], [~inigoiri]

> Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately
> --------------------------------------------------------------
>
>                 Key: HDFS-15761
>                 URL: https://issues.apache.org/jira/browse/HDFS-15761
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ye Ni
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> To decommission a dead DN, the complete logic should be
> Dead, NORMAL -> Dead, DECOMMISSION_INPROGRESS -> Dead, DECOMMISSIONED
> *Currently logic:*
> If a DN is already dead when DECOMMISSIONING starts, it becomes 
> DECOMMISSIONED immediately. DECOMMISSION_INPROGRESS is skipped.
> This logic is introduced by https://issues.apache.org/jira/browse/HDFS-7374
> HDFS-7374 is made because of https://issues.apache.org/jira/browse/HDFS-6791.
> HDFS-6791 keeps the node in DECOMMISSION_INPROGRESS state if the node becomes 
> dead during decommission, which could possibly make a dead DN in 
> DECOMMISSION_INPROGRESS forever, if the DN could never be alive.
> However, putting a dead DN to DECOMMISSIONED directly is not secure. For 
> example, 3 DN of the same block are dead at the same time, then the 
> administrator puts them to DECOMMISSIONED. Namenode should check first before 
> transit them to DECOMMISSIONED. Otherwise, it would be a data loss.
> In this case, all 3 DNs can't become DECOMMISSIONED which is by design. The 
> administrator needs to do some manual intervention, either repair the dead 
> machine or service or recover the data before decommission them.
> This change is to add Dead, DECOMMISSION_INPROGRESS back.
> 1. Dead normal DN is in DECOMMISSION_INPROGRESS first.
> 2. Then checked pendingReplicationBlocksCount and underReplicatedBlocksCount 
> are both 0
> 3. Transit the dead DN to DECOMMISSIONED.
> 2 is implemented by https://issues.apache.org/jira/browse/HDFS-7409, which 
> adds a check to allow dead nodes in DECOMMISSION_IN_PROGRESS to progress to 
> DECOMMISSIONED state if all files on the filesystem are fully-replicated, 
> dead DN is in DECOMMISSION_INPROGRESS, then checked, before become 
> DECOMMISSIONED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-15761) Dead NORMAL DN shouldn't transit to DECOMMISSIONED immediately

Reply via email to