Should this document contain disclaimers about changing slave flags in the face of recovery?
In particular: https://issues.apache.org/jira/browse/MESOS-660 On Mon, Oct 7, 2013 at 11:33 AM, <[email protected]> wrote: > Updated Branches: > refs/heads/master d8da5f4d1 -> 576448554 > > > Added slave recovery doc. > > > Project: http://git-wip-us.apache.org/repos/asf/mesos/repo > Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/57644855 > Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/57644855 > Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/57644855 > > Branch: refs/heads/master > Commit: 57644855419bf5d315b271cb47bd48160eebbe5b > Parents: d8da5f4 > Author: Vinod Kone <[email protected]> > Authored: Mon Oct 7 11:33:07 2013 -0700 > Committer: Vinod Kone <[email protected]> > Committed: Mon Oct 7 11:33:07 2013 -0700 > > ---------------------------------------------------------------------- > docs/Slave-Recovery.md | 67 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 67 insertions(+) > ---------------------------------------------------------------------- > > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/57644855/docs/Slave-Recovery.md > ---------------------------------------------------------------------- > diff --git a/docs/Slave-Recovery.md b/docs/Slave-Recovery.md > new file mode 100644 > index 0000000..637374a > --- /dev/null > +++ b/docs/Slave-Recovery.md > @@ -0,0 +1,67 @@ > +# Slave Recovery # > + > +Slave recovery is a feature of Mesos that allows: > + > + 1. Executors/tasks to keep running when the slave process is down and > + 2. Allows a restarted slave process to reconnect with running > executors/tasks on the slave. > + > +Mesos slave could be restarted for an upgrade or due to a crash. This > feature is introduced in ***0.14.0*** release. > + > +### How does it work? ### > + > +Slave recovery works by having the slave checkpoint enough information > (e.g., Task Info, Executor Info, Status Updates) about the running tasks > and executors to local disk. Once the slave ***and*** the framework(s) > enable checkpointing, any subsequent slave restarts would recover > +the checkpointed information and reconnect with the executors. Note that > if the host running the slave process is rebooted all the executors/tasks > are killed. > + > +> NOTE: To enable slave recovery both the slave and the framework should > explicitly request checkpointing. > +> Alternatively, a framework that doesn't want the disk i/o overhead of > checkpointing can opt out of checkpointing. > + > + > +### Enabling slave checkpointing ### > + > +As part of this feature, 4 new flags were added to the slave. > + > + - `checkpoint` : Whether to checkpoint slave and frameworks information > + to disk [Default: false]. > + - This enables a restarted slave to recover status updates and > reconnect > + with (--recover=reconnect) or kill (--recover=kill) old executorors. > + > + - `strict` : Whether to do recovery in strict mode [Default: true]. > + - If strict=true, any and all recovery errors are considered fatal. > + - If strict=false, any errors (e.g., corruption in checkpointed data) > during recovery are > + ignored and as much state as possible is recovered. > + > + - `recover` : Whether to recover status updates and reconnect with old > executors [Default: reconnect]. > + - If recover=reconnect, Reconnect with any old live executors. > + - If recover=cleanup, Kill any old live executors and exit. > + Use this option when doing an incompatible slave or executor > upgrade!). > + NOTE: If no checkpointing information exists, no recovery is > performed > + and the slave registers with the master as a new slave. > + > + - `recovery_timeout` : Amount of time alloted for the slave to recover > [Default: 15 mins]. > + - If the slave takes longer than `recovery_timeout` to recover, any > executors that are waiting to > + reconnect to the slave will self-terminate. > + NOTE: This flag is only applicable when `--checkpoint` is enabled. > + > + > +> NOTE: If checkpointing is enabled on the slave, but none of the > frameworks have enabled checkpointing, > +> executors/tasks of frameworks die when the slave dies and are not > recovered. > + > +A restarted slave should re-register with master within a timeout > (currently, 75s). If the slave takes longer > +than this timeout to re-register, the master shuts down the slave, which > in turn shuts down any live executors/tasks. > +Therefore, it is highly recommended to automate the process of restarting > a slave (e.g, using [monit](http://mmonit.com/monit/)). > + > + > +**For the complete list of slave options: ./mesos-slave.sh --help** > + > + > +### Enabling framework checkpointing ### > + > +As part of this feature, `FrameworkInfo` has been updated to include an > optional `checkpoint` field. A framework that would like to opt in to > checkpointing should set `FrameworkInfo.checkpoint=True` before registering > with the master. > + > +> NOTE: Frameworks that have anbled checkpointing will only get offers > from checkpointing slave. Therefore, before setting `checkpoint=True` on > FrameworkInfo, ensure that there are slaves in your cluster that have > enabled checkpointing. > +> Because, if there are no checkpointing slaves, the framework would not > get any offers and hence cannot launch any tasks/executors. > + > + > +### Upgrading to 0.14.0 ### > + > +If you want to upgrade a running Mesos cluster to 0.14.0 to take > advantage of slave recovery please follow the [upgrade instructions]( > https://github.com/apache/mesos/blob/master/docs/Upgrades.md). > \ No newline at end of file > >
