Re: Death of the RM

Paul Read Wed, 01 Apr 2015 05:33:46 -0700

If you don't mind adding a connection to zookeeper, storing the tasking
status by host and instance in zookeeper, cleaning up on a graceful RM die,
then you should be able to recover at virtually any point. And have
multiple RMs if that is a goal.

Not sure at this point if the Executor would need to connect to zookeeper
or just the scheduler. At first glance I would think just the Scheduler
however if the RM accidentally dies and then the Executor is killed it may
be reasonable to have it update ZK with status...or just have any RM when
it comes up to re-sync by requesting a sync msg and if it does not get one
in a reasonable amount of time assume its dead...could go so far as to
track PIDs and test to see if they are out there as well.

Just a few thoughts.

On Wed, Apr 1, 2015 at 5:31 AM, Paul Read <[email protected]> wrote:

>
> Is it reasonable to expect the Executor and NM to exit if the the
> RM/Scheduler accidently dies or is killed? Or should a restart of the
> RM/Scheduler re-sync with the running Executor/NM ?
>
> I know there is currently no mechanism to do that but I was looking at
> issue #55 and part of the problem/solution would be eliminated if the child
> tasks were to terminate if the RM dies.
>
>
>

Re: Death of the RM

Reply via email to