Many setups use something like systemd to ensure that if the slave is
shutdown/killed, it will start up again, causing it to register as a new
slaveId. This should solve your first point, An.

On Tue, Jun 30, 2015 at 8:20 AM, Roger Ignazio <[email protected]> wrote:

> I recently posted a similar question to the user list to better understand
> how slave recovery works. You can read the thread at
> http://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/browser
>
> Quoting Vinod from that thread:
>
> > 'recovery_timeout' was added to make sure that if a slave
> > is down for a long time (>10 mins), the executors commit suicide. It is
> > better for the executor/task to die than keep running because the
> framework
> > might have already launched another replica of that instance. This was
> not
> > tied to the 75s timeout (hard coded) because it is possible for a slave
> to
> > successfully re-register with a master after 75s (e.g., both master and
> > slave are down for 5 min).
>
> Adam also replied with a ticket that will allow the 75s ping timeout to be
> configurable in future releases (appears to be 0.23.0 and onward):
> https://issues.apache.org/jira/browse/MESOS-2110
>
> As for shutting down the mesos-slave daemon, I (personally) don't think
> that it's really a problem. There are various tools (Puppet, Monit, etc)
> that allow you to define a service's desired state.
>
> -- Roger
>
> On Tue, Jun 30, 2015 at 3:27 AM, An an Zhao <[email protected]> wrote:
>
> > Hi,
> >     For now, master would kill the slave when re-registering timeout
> > according to the document.
> >
> > > If the slave takes longer than this timeout to re-register, the master
> > shuts down the slave, which in turn shuts down any live executors/tasks.
> >
> > * 1. * I think it's more friendly and directly that the slave only kill
> the
> > executors without exiting, after that the slave start register.
> >      On the other hand, It would take some effort to support this, maybe
> > it's not worth.
> >       What's your opinion?
> >
> > *2. *The slave has a flag   recovery_timeout  which is 15min  by default.
> > Also the slave will fail to re-register and kill the executors when it
> > takes longer than the health check timeout ( which is 75s).   So the
> > executors are useless after 75s.
> >    * I'm wondering why the recovery_timeout is 15min by default. I think
> >  that 75s is enough.*  Is this a good idea?
> >
> >
> >    Thanks for your time.
> >
> > Best regards.
> >
>

Reply via email to