> Only one can be registered with Mesos at a time, so we'd want to ensure that the MyriadScheduler only registers with Mesos if it is the active RM.
This is already the behavior in Myriad. Only the active-RM loads a YARN scheduler (FairScheduler/CapcityScheduler etc). Since Myriad plugs into RM by extending a YARN scheduler, only the active RM will initialize Myriad. Hence Myriad registers with mesos only when there is a RM in "active" state. Thanks, Santosh On Wed, Apr 1, 2015 at 12:47 PM, Adam Bordelon <[email protected]> wrote: > I know one of the things Mohit was working on was scheduler HA and task > reconciliation, so that if the RM dies, then Marathon (or another PaaS) > could restart it elsewhere, recover its state, and reconnect to running > tasks. In this scenario, we definitely want to keep the executors/NMs > running when the RM/scheduler exits. > See https://github.com/mesos/myriad/issues/13 and > https://github.com/mesos/myriad/issues/16 > Mohit, what's the status of this work? Do you have a branch you can share > that others can continue on, if you don't have time to complete it > yourself? > > RM HA via multiple RMs is a separate consideration: > > http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_rm_ha_config.html > There are a few considerations with having multiple RMs (active+standby) > hence multiple schedulers. Only one can be registered with Mesos at a time, > so we'd want to ensure that the MyriadScheduler only registers with Mesos > if it is the active RM. We'd also want to store/retrieve scheduler state > (in ZK/wherever) when failing over to another RM/Scheduler, but we'll do > that for general single-RM HA anyway. > > On Wed, Apr 1, 2015 at 5:33 AM, Paul Read <[email protected]> wrote: > > > If you don't mind adding a connection to zookeeper, storing the tasking > > status by host and instance in zookeeper, cleaning up on a graceful RM > die, > > then you should be able to recover at virtually any point. And have > > multiple RMs if that is a goal. > > > > Not sure at this point if the Executor would need to connect to zookeeper > > or just the scheduler. At first glance I would think just the Scheduler > > however if the RM accidentally dies and then the Executor is killed it > may > > be reasonable to have it update ZK with status...or just have any RM when > > it comes up to re-sync by requesting a sync msg and if it does not get > one > > in a reasonable amount of time assume its dead...could go so far as to > > track PIDs and test to see if they are out there as well. > > > > Just a few thoughts. > > > > On Wed, Apr 1, 2015 at 5:31 AM, Paul Read <[email protected]> wrote: > > > > > > > > Is it reasonable to expect the Executor and NM to exit if the the > > > RM/Scheduler accidently dies or is killed? Or should a restart of the > > > RM/Scheduler re-sync with the running Executor/NM ? > > > > > > I know there is currently no mechanism to do that but I was looking at > > > issue #55 and part of the problem/solution would be eliminated if the > > child > > > tasks were to terminate if the RM dies. > > > > > > > > > > > >
