I know one of the things Mohit was working on was scheduler HA and task
reconciliation, so that if the RM dies, then Marathon (or another PaaS)
could restart it elsewhere, recover its state, and reconnect to running
tasks. In this scenario, we definitely want to keep the executors/NMs
running when the RM/scheduler exits.
See https://github.com/mesos/myriad/issues/13 and
https://github.com/mesos/myriad/issues/16
Mohit, what's the status of this work? Do you have a branch you can share
that others can continue on, if you don't have time to complete it yourself?

RM HA via multiple RMs is a separate consideration:
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_rm_ha_config.html
There are a few considerations with having multiple RMs (active+standby)
hence multiple schedulers. Only one can be registered with Mesos at a time,
so we'd want to ensure that the MyriadScheduler only registers with Mesos
if it is the active RM. We'd also want to store/retrieve scheduler state
(in ZK/wherever) when failing over to another RM/Scheduler, but we'll do
that for general single-RM HA anyway.

On Wed, Apr 1, 2015 at 5:33 AM, Paul Read <[email protected]> wrote:

> If you don't mind adding a connection to zookeeper, storing the tasking
> status by host and instance in zookeeper, cleaning up on a graceful RM die,
> then you should be able to recover at virtually any point. And have
> multiple RMs if that is a goal.
>
> Not sure at this point if the Executor would need to connect to zookeeper
> or just the scheduler. At first glance I would think just the Scheduler
> however if the RM accidentally dies and then the Executor is killed it may
> be reasonable to have it update ZK with status...or just have any RM when
> it comes up to re-sync by requesting a sync msg and if it does not get one
> in a reasonable amount of time assume its dead...could go so far as to
> track PIDs and test to see if they are out there as well.
>
> Just a few thoughts.
>
> On Wed, Apr 1, 2015 at 5:31 AM, Paul Read <[email protected]> wrote:
>
> >
> > Is it reasonable to expect the Executor and NM to exit if the the
> > RM/Scheduler accidently dies or is killed? Or should a restart of the
> > RM/Scheduler re-sync with the running Executor/NM ?
> >
> > I know there is currently no mechanism to do that but I was looking at
> > issue #55 and part of the problem/solution would be eliminated if the
> child
> > tasks were to terminate if the RM dies.
> >
> >
> >
>

Reply via email to