Re: Death of the RM

Santosh Marella Wed, 01 Apr 2015 13:05:25 -0700

> Only one can be registered with Mesos at a time,
so we'd want to ensure that the MyriadScheduler only registers with Mesos
if it is the active RM.


This is already the behavior in Myriad. Only the active-RM loads a YARN
scheduler (FairScheduler/CapcityScheduler etc).
Since Myriad plugs into RM by extending a YARN scheduler, only the active
RM will initialize Myriad. Hence
Myriad registers with mesos only when there is a RM in "active" state.

Thanks,
Santosh

On Wed, Apr 1, 2015 at 12:47 PM, Adam Bordelon <[email protected]> wrote:

> I know one of the things Mohit was working on was scheduler HA and task
> reconciliation, so that if the RM dies, then Marathon (or another PaaS)
> could restart it elsewhere, recover its state, and reconnect to running
> tasks. In this scenario, we definitely want to keep the executors/NMs
> running when the RM/scheduler exits.
> See https://github.com/mesos/myriad/issues/13 and
> https://github.com/mesos/myriad/issues/16
> Mohit, what's the status of this work? Do you have a branch you can share
> that others can continue on, if you don't have time to complete it
> yourself?
>
> RM HA via multiple RMs is a separate consideration:
>
> http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_rm_ha_config.html
> There are a few considerations with having multiple RMs (active+standby)
> hence multiple schedulers. Only one can be registered with Mesos at a time,
> so we'd want to ensure that the MyriadScheduler only registers with Mesos
> if it is the active RM. We'd also want to store/retrieve scheduler state
> (in ZK/wherever) when failing over to another RM/Scheduler, but we'll do
> that for general single-RM HA anyway.
>
> On Wed, Apr 1, 2015 at 5:33 AM, Paul Read <[email protected]> wrote:
>
> > If you don't mind adding a connection to zookeeper, storing the tasking
> > status by host and instance in zookeeper, cleaning up on a graceful RM
> die,
> > then you should be able to recover at virtually any point. And have
> > multiple RMs if that is a goal.
> >
> > Not sure at this point if the Executor would need to connect to zookeeper
> > or just the scheduler. At first glance I would think just the Scheduler
> > however if the RM accidentally dies and then the Executor is killed it
> may
> > be reasonable to have it update ZK with status...or just have any RM when
> > it comes up to re-sync by requesting a sync msg and if it does not get
> one
> > in a reasonable amount of time assume its dead...could go so far as to
> > track PIDs and test to see if they are out there as well.
> >
> > Just a few thoughts.
> >
> > On Wed, Apr 1, 2015 at 5:31 AM, Paul Read <[email protected]> wrote:
> >
> > >
> > > Is it reasonable to expect the Executor and NM to exit if the the
> > > RM/Scheduler accidently dies or is killed? Or should a restart of the
> > > RM/Scheduler re-sync with the running Executor/NM ?
> > >
> > > I know there is currently no mechanism to do that but I was looking at
> > > issue #55 and part of the problem/solution would be eliminated if the
> > child
> > > tasks were to terminate if the RM dies.
> > >
> > >
> > >
> >
>

Re: Death of the RM

Reply via email to