[jira] [Resolved] (HDFS-17231) HA: Safemode should exit when resources are from low to available

Xiaoqiao He (Jira) Tue, 24 Oct 2023 20:45:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiaoqiao He resolved HDFS-17231.
--------------------------------
    Fix Version/s: 3.4.0
     Hadoop Flags: Reviewed
       Resolution: Fixed

> HA: Safemode should exit when resources are from low to available
> -----------------------------------------------------------------
>
>                 Key: HDFS-17231
>                 URL: https://issues.apache.org/jira/browse/HDFS-17231
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.3.4, 3.3.6
>            Reporter: kuper
>            Assignee: kuper
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>         Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
>
>
> The NameNodeResourceMonitor automatically enters safe mode when it detects 
> that the resources are not sufficient. When zkfc detects insufficient 
> resources, it triggers failover. Consider the following scenario:
>  * Initially, nn01 is active and nn02 is standby. Due to insufficient 
> resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the 
> resource issue and puts nn01 into safemode. Subsequently, zkfc triggers 
> failover.
>  * At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
> safemode (OFF) and active.
>  * After a period of time, the resources in nn01's dfs.namenode.name.dir 
> recover, causing a slight instability and triggering failover again.
>  * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
> (OFF) and standby.
>  * However, since nn01 is active but in safemode (ON), hdfs cannot be read 
> from or written to.
> !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png!
> *reproduction*
>  # Increase the dfs.namenode.resource.du.reserved
>  # Increase the ha.health-monitor.check-interval.ms can avoid directly 
> switching to standby and stopping the NameNodeResourceMonitor thread. 
> Instead, it is necessary to wait for the NameNodeResourceMonitor to enter 
> safe mode before switching to standby.
>  # On the nn01 active node, using the dd command to create a file that 
> exceeds the threshold, triggering a low on available disk space condition. 
>  # If the nn01 namenode process is not dead, the situation of nn01 safemode 
> (ON) and standby occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HDFS-17231) HA: Safemode should exit when resources are from low to available

Reply via email to