Updated docs to mention new flag. Review: https://reviews.apache.org/r/64450/
Project: http://git-wip-us.apache.org/repos/asf/mesos/repo Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/ee621bb3 Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/ee621bb3 Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/ee621bb3 Branch: refs/heads/master Commit: ee621bb36debf9d88325ef339afa46ade384e23d Parents: 196fe20 Author: Benno Evers <bev...@mesosphere.com> Authored: Wed Dec 20 15:00:08 2017 +0100 Committer: Alexander Rukletsov <al...@apache.org> Committed: Wed Dec 20 15:01:51 2017 +0100 ---------------------------------------------------------------------- CHANGELOG | 4 ++ docs/agent-recovery.md | 129 +++++++++++++++++++++++++-------------- docs/configuration/agent.md | 18 ++++++ src/slave/flags.cpp | 12 ++-- 4 files changed, 111 insertions(+), 52 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mesos/blob/ee621bb3/CHANGELOG ---------------------------------------------------------------------- diff --git a/CHANGELOG b/CHANGELOG index 63ac84c..f1c4195 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,6 +1,10 @@ Release Notes - Mesos - Version 1.5.0 (WIP) ------------------------------------------- This release contains the following new features: + * [MESOS-1739] - Agents now support the `--reconfiguration_policy` + flag which allows them to recover the agent ID and running tasks + after configuration changes. See docs/agent-recovery.md for more + details. Deprecations/Removals: * Agent flag `--executor_secret_key` has been deprecated. Operators http://git-wip-us.apache.org/repos/asf/mesos/blob/ee621bb3/docs/agent-recovery.md ---------------------------------------------------------------------- diff --git a/docs/agent-recovery.md b/docs/agent-recovery.md index 35cd5b1..8df72cb 100644 --- a/docs/agent-recovery.md +++ b/docs/agent-recovery.md @@ -8,71 +8,108 @@ layout: documentation If the `mesos-agent` process on a host exits (perhaps due to a Mesos bug or because the operator kills the process while [upgrading Mesos](upgrades.md)), any executors/tasks that were being managed by the `mesos-agent` process will -continue to run. When `mesos-agent` is restarted, the operator can control how -those old executors/tasks are handled: +continue to run. - 1. By default, all the executors/tasks that were being managed by the old - `mesos-agent` process are killed. - 2. If a framework enabled _checkpointing_ when it registered with the master, - any executors belonging to that framework can reconnect to the new - `mesos-agent` process and continue running uninterrupted. +By default, all the executors/tasks that were being managed by the old +`mesos-agent` process are expected to gracefully exit on their own, and +will be shut down after the agent restarted if they did not. -Hence, enabling framework checkpointing enables tasks to tolerate Mesos agent -upgrades and unexpected `mesos-agent` crashes without experiencing any -downtime. +However, if a framework enabled _checkpointing_ when it registered with the +master, any executors belonging to that framework can reconnect to the new +`mesos-agent` process and continue running uninterrupted. Hence, enabling +framework checkpointing allows tasks to tolerate Mesos agent upgrades and +unexpected `mesos-agent` crashes without experiencing any downtime. -Agent recovery works by having the agent _checkpoint_ information (e.g., Task -Info, Executor Info, Status Updates) about the tasks and executors it is -managing to local disk. If a framework enables checkpointing, any subsequent -agent restarts will recover the checkpointed information and reconnect with any -executors that are still running. +Agent recovery works by having the agent checkpoint information about its own +state and about the tasks and executors it is managing to local disk, for +example the `SlaveInfo`, `FrameworkInfo` and `ExecutorInfo` messages or the +unacknowledged status updates of running tasks. -Note that if the operating system on the agent is rebooted, all executors and -tasks running on the host are killed and are not automatically restarted when -the host comes back up. +When the agent restarts, it will verify that its current configuration, set +from the environment variables and command-line flags, is compatible with the +checkpointed information and will refuse to restart if not. -However the agent is allowed to recover its agent ID post a host reboot. -In case the agent's recovery runs into agent info mismatch which may happen due to resource change associated with reboot, it'll fall back to recovering as a new agent (existing behavior). -In other cases such as checkpointed resources (e.g. persistent volumes) being incompatible with the agent's resources the recovery will still fail (existing behavior). +A special case occurs when the agent detects that its host system was rebooted +since the last run of the agent: The agent will try to recover its previous ID +as usual, but if that fails it will actually erase the information of the +previous run and will register with the master as a new agent. + +Note that executors and tasks that exited between agent shutdown and restart +are not automatically restarted during agent recovery. ## Framework Configuration -A framework can control whether its executors will be recovered by setting the `checkpoint` flag in its `FrameworkInfo` when registering with the master. Enabling this feature results in increased I/O overhead at each agent that runs tasks launched by the framework. By default, frameworks do **not** checkpoint their state. +A framework can control whether its executors will be recovered by setting +the `checkpoint` flag in its `FrameworkInfo` when registering with the master. +Enabling this feature results in increased I/O overhead at each agent that runs +tasks launched by the framework. By default, frameworks do **not** checkpoint +their state. ## Agent Configuration -Three [configuration flags](configuration/agent.md) control the recovery behavior of a Mesos agent: +Four [configuration flags](configuration/agent.md) control the recovery +behavior of a Mesos agent: * `strict`: Whether to do agent recovery in strict mode [Default: true]. - If strict=true, all recovery errors are considered fatal. - - If strict=false, any errors (e.g., corruption in checkpointed data) during recovery are - ignored and as much state as possible is recovered. - -* `recover`: Whether to recover status updates and reconnect with old executors [Default: reconnect]. - - If recover=reconnect, reconnect with any old live executors, provided the executor's framework enabled checkpointing. - - If recover=cleanup, kill any old live executors and exit. Use this option when doing an incompatible agent or executor upgrade! - > NOTE: If no checkpointing information exists, no recovery is performed - > and the agent registers with the master as a new agent. - -* `recovery_timeout`: Amount of time allotted for the agent to recover [Default: 15 mins]. - - If the agent takes longer than `recovery_timeout` to recover, any executors that are waiting to - reconnect to the agent will self-terminate. - -> NOTE: If none of the frameworks have enabled checkpointing, -> the executors and tasks running at an agent die when the agent dies -> and are not recovered. - -A restarted agent should re-register with master within a timeout (75 seconds by default: see the `--max_agent_ping_timeouts` and `--agent_ping_timeout` [configuration flags](configuration.md)). If the agent takes longer than this timeout to re-register, the master shuts down the agent, which in turn will shutdown any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting an agent (e.g., using a process supervisor such as [monit](http://mmonit.com/monit/) or `systemd`). + - If strict=false, any errors (e.g., corruption in checkpointed data) during + recovery are ignored and as much state as possible is recovered. + +* `reconfiguration_policy`: Which kind of configuration changes are accepted + when trying to recover [Default: equal]. + - If reconfiguration_policy=equal, no configuration changes are accepted. + - If reconfiguration_policy=additive, the agent will allow the new + configuration to contain additional attributes, increased resourced or an + additional fault domain. For a more detailed description, see + [this](https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/slave/compatibility.hpp;h=78b421a01abe5d2178c93832577577a7ba282b38;hb=HEAD#l37). + +* `recover`: Whether to recover status updates and reconnect with old + executors [Default: reconnect] + - If recover=reconnect, reconnect with any old live executors, provided + the executor's framework enabled checkpointing. + - If recover=cleanup, kill any old live executors and exit. Use this + option when doing an incompatible agent or executor upgrade! + **NOTE:** If no checkpointing information exists, no recovery is performed + and the agent registers with the master as a new agent. + +* `recovery_timeout`: Amount of time allotted for the agent to + recover [Default: 15 mins]. + - If the agent takes longer than `recovery_timeout` to recover, any + executors that are waiting to reconnect to the agent will self-terminate. + **NOTE:** If none of the frameworks have enabled checkpointing, the + executors and tasks running at an agent die when the agent dies and are + not recovered. + +A restarted agent should re-register with master within a timeout (75 seconds +by default: see the `--max_agent_ping_timeouts` and `--agent_ping_timeout` +[configuration flags](configuration.md)). If the agent takes longer than this +timeout to re-register, the master shuts down the agent, which in turn will +shutdown any live executors/tasks. + +Therefore, it is highly recommended to automate the process of restarting an +agent, e.g. using a process supervisor such as [monit](http://mmonit.com/monit/) +or `systemd`. ## Known issues with `systemd` and process lifetime -There is a known issue when using `systemd` to launch the `mesos-agent`. A description of the problem can be found in [MESOS-3425](https://issues.apache.org/jira/browse/MESOS-3425) and all relevant work can be tracked in the epic [MESOS-3007](https://issues.apache.org/jira/browse/MESOS-3007). -This problem was fixed in Mesos `0.25.0` for the mesos containerizer when cgroups isolation is enabled. Further fixes for the posix isolators and docker containerizer are available in `0.25.1`, `0.26.1`, `0.27.1`, and `0.28.0`. +There is a known issue when using `systemd` to launch the `mesos-agent`. A +description of the problem can be found in [MESOS-3425](https://issues.apache.org/jira/browse/MESOS-3425) +and all relevant work can be tracked in the epic [MESOS-3007](https://issues.apache.org/jira/browse/MESOS-3007). + +This problem was fixed in Mesos `0.25.0` for the mesos containerizer when +cgroups isolation is enabled. Further fixes for the posix isolators and docker +containerizer are available in `0.25.1`, `0.26.1`, `0.27.1`, and `0.28.0`. -It is recommended that you use the default [KillMode](http://www.freedesktop.org/software/systemd/man/systemd.kill.html) for systemd processes, which is `control-group`, which kills all child processes when the agent stops. This ensures that "side-car" processes such as the `fetcher` and `perf` are terminated alongside the agent. -The systemd patches for Mesos explicitly move executors and their children into a separate systemd slice, dissociating their lifetime from the agent. This ensures the executors survive agent restarts. +It is recommended that you use the default [KillMode](http://www.freedesktop.org/software/systemd/man/systemd.kill.html) +for systemd processes, which is `control-group`, which kills all child processes +when the agent stops. This ensures that "side-car" processes such as the +`fetcher` and `perf` are terminated alongside the agent. +The systemd patches for Mesos explicitly move executors and their children into +a separate systemd slice, dissociating their lifetime from the agent. This +ensures the executors survive agent restarts. -The following excerpt of a `systemd` unit configuration file shows how to set the flag explicitly: +The following excerpt of a `systemd` unit configuration file shows how to set +the flag explicitly: ``` [Service] http://git-wip-us.apache.org/repos/asf/mesos/blob/ee621bb3/docs/configuration/agent.md ---------------------------------------------------------------------- diff --git a/docs/configuration/agent.md b/docs/configuration/agent.md index d629912..b7a242c 100644 --- a/docs/configuration/agent.md +++ b/docs/configuration/agent.md @@ -1047,6 +1047,24 @@ this flag. (default: 0secs) </tr> <tr> <td> + --reconfiguration_policy=VALUE + </td> + <td> +This flag controls which agent configuration changes are considered +acceptable when recovering the previous agent state. Possible values: + equal: The old and the new state must match exactly. + additive: The new state must be a superset of the old state: + it is permitted to add additional resources, attributes + and domains but not to remove or to modify existing ones. + +Note that this only affects the checking done on the agent itself, +the master may still reject the agent if it detects a change that it +considers unacceptable, which, e.g., currently happens when port or hostname +are changed. (default: equal) + </td> +</tr> +<tr> + <td> --recover=VALUE </td> <td> http://git-wip-us.apache.org/repos/asf/mesos/blob/ee621bb3/src/slave/flags.cpp ---------------------------------------------------------------------- diff --git a/src/slave/flags.cpp b/src/slave/flags.cpp index c789e7d..48b8821 100644 --- a/src/slave/flags.cpp +++ b/src/slave/flags.cpp @@ -496,14 +496,14 @@ mesos::internal::slave::Flags::Flags() "reconfiguration_policy", "This flag controls which agent configuration changes are considered\n" "acceptable when recovering the previous agent state. Possible values:\n" - "equal: Require that the old and the new state match exactly.\n" - "additive: Require that the new state is a superset of the old state:\n" + "equal: The old and the new state must match exactly.\n" + "additive: The new state must be a superset of the old state:\n" " it is permitted to add additional resources, attributes\n" - " and domains but not to remove existing ones.\n" + " and domains but not to remove or to modify existing ones.\n" "Note that this only affects the checking done on the agent itself,\n" - "the master may still reject the slave if it detects a change that it\n" - "considers unacceptable, which currently happens when port or hostname\n" - "are changed.", + "the master may still reject the agent if it detects a change that it\n" + "considers unacceptable, which, e.g., currently happens when port or\n" + "hostname are changed.", "equal"); add(&Flags::strict,