Many setups use something like systemd to ensure that if the slave is shutdown/killed, it will start up again, causing it to register as a new slaveId. This should solve your first point, An.
On Tue, Jun 30, 2015 at 8:20 AM, Roger Ignazio <[email protected]> wrote: > I recently posted a similar question to the user list to better understand > how slave recovery works. You can read the thread at > http://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/browser > > Quoting Vinod from that thread: > > > 'recovery_timeout' was added to make sure that if a slave > > is down for a long time (>10 mins), the executors commit suicide. It is > > better for the executor/task to die than keep running because the > framework > > might have already launched another replica of that instance. This was > not > > tied to the 75s timeout (hard coded) because it is possible for a slave > to > > successfully re-register with a master after 75s (e.g., both master and > > slave are down for 5 min). > > Adam also replied with a ticket that will allow the 75s ping timeout to be > configurable in future releases (appears to be 0.23.0 and onward): > https://issues.apache.org/jira/browse/MESOS-2110 > > As for shutting down the mesos-slave daemon, I (personally) don't think > that it's really a problem. There are various tools (Puppet, Monit, etc) > that allow you to define a service's desired state. > > -- Roger > > On Tue, Jun 30, 2015 at 3:27 AM, An an Zhao <[email protected]> wrote: > > > Hi, > > For now, master would kill the slave when re-registering timeout > > according to the document. > > > > > If the slave takes longer than this timeout to re-register, the master > > shuts down the slave, which in turn shuts down any live executors/tasks. > > > > * 1. * I think it's more friendly and directly that the slave only kill > the > > executors without exiting, after that the slave start register. > > On the other hand, It would take some effort to support this, maybe > > it's not worth. > > What's your opinion? > > > > *2. *The slave has a flag recovery_timeout which is 15min by default. > > Also the slave will fail to re-register and kill the executors when it > > takes longer than the health check timeout ( which is 75s). So the > > executors are useless after 75s. > > * I'm wondering why the recovery_timeout is 15min by default. I think > > that 75s is enough.* Is this a good idea? > > > > > > Thanks for your time. > > > > Best regards. > > >
