Repository: mesos
Updated Branches:
  refs/heads/master 65030c2e8 -> 20903089b


Documented issue with slave recovery when using systemd.

Documented the problem and solution encountered in MESOS-2419.

Review: https://reviews.apache.org/r/32543


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/20903089
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/20903089
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/20903089

Branch: refs/heads/master
Commit: 20903089b65daf8b032942312d1b7112e2ffdc07
Parents: 65030c2
Author: Joerg Schad <[email protected]>
Authored: Sun Jul 5 20:55:42 2015 -0700
Committer: Benjamin Hindman <[email protected]>
Committed: Sun Jul 5 21:22:30 2015 -0700

----------------------------------------------------------------------
 docs/slave-recovery.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/20903089/docs/slave-recovery.md
----------------------------------------------------------------------
diff --git a/docs/slave-recovery.md b/docs/slave-recovery.md
index 4bb4a71..73b8372 100644
--- a/docs/slave-recovery.md
+++ b/docs/slave-recovery.md
@@ -63,6 +63,21 @@ As part of this feature, `FrameworkInfo` has been updated to 
include an optional
 > NOTE: Frameworks that have enabled checkpointing will only get offers from 
 > checkpointing slaves. So, before setting `checkpoint=True` on FrameworkInfo, 
 > ensure that there are slaves in your cluster that have enabled checkpointing.
 > Because, if there are no checkpointing slaves, the framework would not get 
 > any offers and hence cannot launch any tasks/executors!
 
+## Known issues with `systemd` and POSIX isolation
+
+There is a known issue when using `systemd` to launch the `mesos-slave` while 
also using only `posix` isolation mechanisms that prevents tasks from 
recovering. The problem is that the default 
[KillMode](http://www.freedesktop.org/software/systemd/man/systemd.kill.html) 
for systemd processes is `cgroup` and hence all child processes are killed when 
the slave stops. Explicitly setting `KillMode` to `process` allows the 
executors to survive and reconnect.
+
+The following excerpt of a `systemd` unit configuration file shows how to set 
the flag:
+
+```
+[Service]
+ExecStart=/usr/bin/mesos-slave
+KillMode=process
+```
+
+> NOTE: There are also known issues with using `systemd` and raw `cgroups` 
based isolation, for now the suggested non-Posix isolation mechanism is to use 
Docker containerization.
+
+
 ## Upgrading to 0.14.0
 
 If you want to upgrade a running Mesos cluster to 0.14.0 to take advantage of 
slave recovery please follow the [upgrade instructions](upgrades.md).

Reply via email to