gp1314 opened a new pull request, #6207:
URL: https://github.com/apache/hadoop/pull/6207
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
The NameNodeResourceMonitor automatically enters safe mode when it detects
that the resources are not sufficient. When zkfc detects insufficient
resources, it triggers failover. Consider the following scenario:
Initially, nn01 is active and nn02 is standby. Due to insufficient resources
in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource
issue and puts nn01 into safemode. Subsequently, zkfc triggers failover.
- At this point, nn01 is in safemode (ON) and standby, while nn02 is in
safemode (OFF) and active.
- After a period of time, the resources in nn01's dfs.namenode.name.dir
recover, causing a slight instability and triggering failover again.
- Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode
(OFF) and standby.
- However, since nn01 is active but in safemode (ON), hdfs cannot be read
from or written to.
**reproduction**
1. Increase the dfs.namenode.resource.du.reserved
2. Increase the ha.health-monitor.check-interval.ms can avoid directly
switching to standby and stopping the NameNodeResourceMonitor thread. Instead,
it is necessary to wait for the NameNodeResourceMonitor to enter safe mode
before switching to standby.
3. On the nn01 active node, using the dd command to create a file that
exceeds the threshold, triggering a low on available disk space condition.
4. If the nn01 namenode process is not dead, the situation of nn01 safemode
(ON) and standby occurs.
### How was this patch tested?
unit test.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]