Paul Rubio created HADOOP-10684:
-----------------------------------
Summary: Extend HA support for more use cases
Key: HADOOP-10684
URL: https://issues.apache.org/jira/browse/HADOOP-10684
Project: Hadoop Common
Issue Type: Improvement
Components: ha
Reporter: Paul Rubio
Priority: Minor
We'd like the current HA framework to be more configurable from a behavior
standpoint. In particular:
- Add the ability for a HAServiceTarget to survive a configurable number of
health check failures (default of 0) before HealthMonitor (HM) reports service
not responding or service unhealthy. For instance, predicate the HM on a state
machine whose default implementation can be overridden by method or constructor
argument. The default would behave the same as today.
-- If a target fails a health check but does not exceed the maximum number of
consecutive check failures, it’d be desirable if the target and/or controller
were alerted.
--- i.e. Introduce a SERVICE_DYING state
--Additionally, it’d be desirable if a mechanism existed, similar to fencing
semantics, for “reviving” a service that transitioned to SERVICE_DYING.
--- i.e. attemptRevive(…)
- Add the ability to allow a service to completely fail (no failover or
failback possible). There are scenarios where allowing a failover or failback
could cause more damage.
-- E.g. a recovered master with stale data. The master may have been manually
recovered (human error).
- Add affinity to a particular HAServiceTarget.
-- In other words, allow the controller to prefer one target over another when
deciding leadership.
-- If a higher affinity, but previously unhealthy target, becomes healthy then
it should be allowed to become the leader.
-- Likewise, if two targets are racing for a ZooKeeper lock, then the
controller should "prefer" the higher the affinity target.
-- It might make more sense to add a different implementation/subclass of the
ZKFailoverController (i.e. ZKAffinityFailoverController) than modify current
behavior.
Please comment with thoughts/ideas/etc...
Thanks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)