Hi, We are in the process of merging the core building blocks of the topology health manager (HM) based on Dhalion. This integration is still experimental and needs to be tested thoroughly. So it is desired that the HM be activated on-demand and remain disabled by default. Accordingly we are proposing the following scheme to launch HM process.
We are thinking of satisfying the following constraints: 1. Launch on container-0, colocated with the scheduler and the metrics cache. 2. Initially HM will be disabled by default. This means HM process should not be started to avoid any side-effects. Once HM is well tested, a system wide configuration would enable HM for all topologies submitted afterwards. 3. If topology explicitly configure, opt-in, HM will be started and take actions as per the configuration, i.e. healthmgr.yaml 4. Like other Heron processes, executor should manage the HM's life cycle Accordingly we propose the following. 1. Add new Config api to enable self-healing per topology: Config.enableHealthManager(Topology.HealthManagerMode mode). Default value will be "system" to indicate use the system wide configuration. 2. Add a new config to heron_internal.yaml: "heron.healthmgr.default.mode". The value will be "disabled". 3. The Scheduler will read the default value of HM mode from the heron_internals config file, like done in SchedulerMain.setupLogging [3]. It will provide the either the user configured mode value or the default mode value to the executor as a command line argument. 4. Add HM mode to the command like arguments to heron_executor.py. This is similar to the executor command line arguments for check pointing [2]. 5. The executor will launch HM if mode is not "disabled". 6. Later if the default HM mode value is set to "dryrun" or "self-healing", HM will be launched for all newly submitted topologies. What do you think about this approach? Thanks, Ashvin [1] https://github.com/twitter/heron/pull/2132 [2] https://github.com/twitter/heron/blob/master/heron/executor/src/python/ heron_executor.py#L58 [3] https://github.com/twitter/heron/blob/master/ heron/scheduler-core/src/java/com/twitter/heron/scheduler/ SchedulerMain.java#L277