Hi,

We are in the process of merging the core building blocks of the topology
health manager (HM) based on Dhalion. This integration is still
experimental and needs to be tested thoroughly. So it is desired that the
HM be activated on-demand and remain disabled by default. Accordingly we
are proposing the following scheme to launch HM process.

We are thinking of satisfying the following constraints:

   1. Launch on container-0, colocated with the scheduler and the metrics
   cache.
   2. Initially HM will be disabled by default. This means HM process
   should not be started to avoid any side-effects. Once HM is well tested, a
   system wide configuration would enable HM for all topologies submitted
   afterwards.
   3. If topology explicitly configure, opt-in, HM will be started and take
   actions as per the configuration, i.e. healthmgr.yaml
   4. Like other Heron processes, executor should manage the HM's life cycle

Accordingly we propose the following.

   1. Add new Config api to enable self-healing per topology:
   Config.enableHealthManager(Topology.HealthManagerMode mode). Default
   value will be "system" to indicate use the system wide configuration.
   2. Add a new config to heron_internal.yaml:
   "heron.healthmgr.default.mode". The value will be "disabled".
   3. The Scheduler will read the default value of HM mode from the
   heron_internals config file, like done in SchedulerMain.setupLogging [3].
   It will provide the either the user configured mode value or the default
   mode value to the executor as a command line argument.
   4. Add HM mode to the command like arguments to heron_executor.py. This
   is similar to the executor command line arguments for check pointing [2].
   5. The executor will launch HM if mode is not "disabled".
   6. Later if the default HM mode value is set to "dryrun" or
   "self-healing", HM will be launched for all newly submitted topologies.


What do you think about this approach?

Thanks,
Ashvin


[1] https://github.com/twitter/heron/pull/2132
[2] https://github.com/twitter/heron/blob/master/heron/executor/src/python/
heron_executor.py#L58
[3] https://github.com/twitter/heron/blob/master/
heron/scheduler-core/src/java/com/twitter/heron/scheduler/
SchedulerMain.java#L277

Reply via email to