If you don't mind adding a connection to zookeeper, storing the tasking status by host and instance in zookeeper, cleaning up on a graceful RM die, then you should be able to recover at virtually any point. And have multiple RMs if that is a goal.
Not sure at this point if the Executor would need to connect to zookeeper or just the scheduler. At first glance I would think just the Scheduler however if the RM accidentally dies and then the Executor is killed it may be reasonable to have it update ZK with status...or just have any RM when it comes up to re-sync by requesting a sync msg and if it does not get one in a reasonable amount of time assume its dead...could go so far as to track PIDs and test to see if they are out there as well. Just a few thoughts. On Wed, Apr 1, 2015 at 5:31 AM, Paul Read <[email protected]> wrote: > > Is it reasonable to expect the Executor and NM to exit if the the > RM/Scheduler accidently dies or is killed? Or should a restart of the > RM/Scheduler re-sync with the running Executor/NM ? > > I know there is currently no mechanism to do that but I was looking at > issue #55 and part of the problem/solution would be eliminated if the child > tasks were to terminate if the RM dies. > > >
