[
https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HADOOP-8247:
--------------------------------
Attachment: hadoop-8247.txt
Here's a preliminary patch for this issue. It still needs a little
cleanup/javadoc/etc, but wanted to make sure people agree this is the right
direction before I finish it up.
Here's a summary of the change:
- Add a new flag dfs.ha.automatic-failover.enabled, which is set
per-nameservice or globally
- Add a new RequestInfo structure as a parameter to all the HAServiceProtocol
methods. This currently just has one field, which indicates what type of client
the request is on behalf of. It can either be a user (manual CLI failover),
ZKFC (auto failover), or USER_FORCE -- indicating that it's a user who wants to
avoid this safety check.
- In the NN, if auto-failover is enabled, disallow HA requests from users. If
it's not enabled, disallow HA requests from ZKFCs.
- In the ZKFC, disallow startup if auto-failover is disabled
In addition to the unit tests, I ran the following manual tests, on a secure
cluster.
1) did not enable auto failover config
2) ran failovers using haadmin command, succesfully
3) Tried to run bin/hdfs zkfc, got expected error:
{code}
12/04/05 20:53:38 INFO tools.DFSZKFailoverController: Failover controller
configured for NameNode nameserviceId1.nn1
12/04/05 20:53:38 FATAL ha.ZKFailoverController: Automatic failover is not
enabled for NameNode at todd-w510/127.0.0.1:8021. Please ensure that automatic
failover is enabled in the configuration before running the ZK failover
controller.
{code}
4) Enabled auto-failover in my config, but left NNs running. Got error when the
ZKFC tried to make the local node active. TODO in future JIRA: it could abort
at this point, when it sees an AccessControlException, since it's indicative of
misconfiguration.
5) Restarted NNs, so they picked up the new config.
6) Ran ZKFC, it successfully made one of the NNs active. Verified automatic
failover behavior by killing one of the NNs.
7) Ran manual failover command, got expected error:
{code}
12/04/05 20:58:31 ERROR ha.FailoverController: Unable to get service state for
NameNode at todd-w510/127.0.0.1:8022: Manual HA control for this NameNode is
disallowed, because automatic HA is enabled.
{code}
----
Open questions: should we allow the non-mutative commands like
{{monitorHealth}} and {{getServiceState}} to run when auto-failover is
configured? My thinking is probably. If so, should we keep around the
RequestInfo parameter on those calls? Or only include RequestInfo for the calls
that trigger transitions?
> Auto-HA: add a config to enable auto-HA, which disables manual FC
> -----------------------------------------------------------------
>
> Key: HADOOP-8247
> URL: https://issues.apache.org/jira/browse/HADOOP-8247
> Project: Hadoop Common
> Issue Type: Improvement
> Components: auto-failover, ha
> Affects Versions: Auto Failover (HDFS-3042)
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hadoop-8247.txt
>
>
> Currently, if automatic failover is set up and running, and the user uses the
> "haadmin -failover" command, he or she can end up putting the system in an
> inconsistent state, where the state in ZK disagrees with the actual state of
> the world. To fix this, we should add a config flag which is used to enable
> auto-HA. When this flag is set, we should disallow use of the haadmin command
> to initiate failovers. We should refuse to run ZKFCs when the flag is not
> set. Of course, this flag should be scoped by nameservice.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira