[PR] HDFS-17231.HA: Safemode should exit when resources are from low to available [hadoop]

via GitHub Fri, 20 Oct 2023 00:40:13 -0700


gp1314 opened a new pull request, #6207:
URL: https://github.com/apache/hadoop/pull/6207


   
   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   The NameNodeResourceMonitor automatically enters safe mode when it detects 
that the resources are not sufficient. When zkfc detects insufficient 
resources, it triggers failover. Consider the following scenario:
   Initially, nn01 is active and nn02 is standby. Due to insufficient resources 
in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource 
issue and puts nn01 into safemode. Subsequently, zkfc triggers failover.
   
   - At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
safemode (OFF) and active.
   - After a period of time, the resources in nn01's dfs.namenode.name.dir 
recover, causing a slight instability and triggering failover again.
   - Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
(OFF) and standby.
   - However, since nn01 is active but in safemode (ON), hdfs cannot be read 
from or written to.
   
   **reproduction**
   
   1. Increase the dfs.namenode.resource.du.reserved
   2. Increase the ha.health-monitor.check-interval.ms can avoid directly 
switching to standby and stopping the NameNodeResourceMonitor thread. Instead, 
it is necessary to wait for the NameNodeResourceMonitor to enter safe mode 
before switching to standby.
   3. On the nn01 active node, using the dd command to create a file that 
exceeds the threshold, triggering a low on available disk space condition. 
   4. If the nn01 namenode process is not dead, the situation of nn01 safemode 
(ON) and standby occurs.
   
   ### How was this patch tested?
   unit test.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDFS-17231.HA: Safemode should exit when resources are from low to available [hadoop]

Reply via email to