Repository: aurora Updated Branches: refs/heads/master a9e7a35a2 -> 2d59b697a
Move lifecycle documentation into separate file In addition to the move, a couple of releted additions and adjustements have been made: * slight reorganization * documentation of missing states (THROTTELD, DRAINING) * custom section on reconciliation * remark regarding the uniqueness of an instance * updated documentation of the teardown of a task (HTTPLifecycleConfig and finalization_wait) Bugs closed: AURORA-1068, AURORA-1262, AURORA-734 Reviewed at https://reviews.apache.org/r/43013/ Project: http://git-wip-us.apache.org/repos/asf/aurora/repo Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/2d59b697 Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/2d59b697 Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/2d59b697 Branch: refs/heads/master Commit: 2d59b697a745f9540d53da1659cda4683c929b34 Parents: a9e7a35 Author: Stephan Erb <[email protected]> Authored: Sat Feb 6 23:22:21 2016 +0100 Committer: Stephan Erb <[email protected]> Committed: Sat Feb 6 23:22:21 2016 +0100 ---------------------------------------------------------------------- docs/README.md | 1 + docs/configuration-reference.md | 19 ++--- docs/task-lifecycle.md | 146 +++++++++++++++++++++++++++++++++++ docs/user-guide.md | 125 ++---------------------------- 4 files changed, 164 insertions(+), 127 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/aurora/blob/2d59b697/docs/README.md ---------------------------------------------------------------------- diff --git a/docs/README.md b/docs/README.md index 8ebc061..78f062a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -11,6 +11,7 @@ We encourage you to ask questions on the [Aurora user list](http://aurora.apache * [Install Aurora on virtual machines on your private machine](vagrant.md) * [Hello World Tutorial](tutorial.md) * [User Guide](user-guide.md) + * [Task Lifecycle](task-lifecycle.md) * [Configuration Tutorial](configuration-tutorial.md) * [Aurora + Thermos Reference](configuration-reference.md) * [Command Line Client](client-commands.md) http://git-wip-us.apache.org/repos/asf/aurora/blob/2d59b697/docs/configuration-reference.md ---------------------------------------------------------------------- diff --git a/docs/configuration-reference.md b/docs/configuration-reference.md index 995f706..3f023d7 100644 --- a/docs/configuration-reference.md +++ b/docs/configuration-reference.md @@ -312,10 +312,10 @@ upon one final Process ("reducer") to tabulate the results: #### finalization_wait -Tasks have three active stages: `ACTIVE`, `CLEANING`, and `FINALIZING`. The -`ACTIVE` stage is when ordinary processes run. This stage lasts as -long as Processes are running and the Task is healthy. The moment either -all Processes have finished successfully or the Task has reached a +Process execution is organizued into three active stages: `ACTIVE`, +`CLEANING`, and `FINALIZING`. The `ACTIVE` stage is when ordinary processes run. +This stage lasts as long as Processes are running and the Task is healthy. +The moment either all Processes have finished successfully or the Task has reached a maximum Process failure limit, it goes into `CLEANING` stage and send SIGTERMs to all currently running Processes and their process trees. Once all Processes have terminated, the Task goes into `FINALIZING` stage @@ -327,10 +327,7 @@ finish during that time, all remaining Processes are sent SIGKILLs (or if they depend upon uncompleted Processes, are never invoked.) -Client applications with higher priority may force a shorter -finalization wait (e.g. through parameters to `thermos kill`), so this -is mostly a best-effort signal. - +When running on Aurora, the `finalization_wait` is capped at 60 seconds. ### Constraint Object @@ -515,7 +512,7 @@ Describes the container the job's processes will run inside. ### Docker Parameter Object Docker CLI parameters. This needs to be enabled by the scheduler `enable_docker_parameters` option. -See [Docker Command Line Reference](https://docs.docker.com/reference/commandline/run/) for valid parameters. +See [Docker Command Line Reference](https://docs.docker.com/reference/commandline/run/) for valid parameters. param | type | description ----- | :----: | ----------- @@ -611,6 +608,10 @@ to distinguish between Task replicas. | ```instance``` | Integer | The instance number of the created task. A job with 5 replicas has instance numbers 0, 1, 2, 3, and 4. | ```hostname``` | String | The instance hostname that the task was launched on. +Please note, there is no uniqueness guarantee for `instance` in the presence of +network partitions. If that is required, it should be baked in at the application +level using a distributed coordination service such as Zookeeper. + ### thermos Namespace The `thermos` namespace contains variables that work directly on the http://git-wip-us.apache.org/repos/asf/aurora/blob/2d59b697/docs/task-lifecycle.md ---------------------------------------------------------------------- diff --git a/docs/task-lifecycle.md b/docs/task-lifecycle.md new file mode 100644 index 0000000..e85e754 --- /dev/null +++ b/docs/task-lifecycle.md @@ -0,0 +1,146 @@ +# Task Lifecycle + +When Aurora reads a configuration file and finds a `Job` definition, it: + +1. Evaluates the `Job` definition. +2. Splits the `Job` into its constituent `Task`s. +3. Sends those `Task`s to the scheduler. +4. The scheduler puts the `Task`s into `PENDING` state, starting each + `Task`'s life cycle. + + + + +Please note, a couple of task states described below are missing from +this state diagram. + + +## PENDING to RUNNING states + +When a `Task` is in the `PENDING` state, the scheduler constantly +searches for machines satisfying that `Task`'s resource request +requirements (RAM, disk space, CPU time) while maintaining configuration +constraints such as "a `Task` must run on machines dedicated to a +particular role" or attribute limit constraints such as "at most 2 +`Task`s from the same `Job` may run on each rack". When the scheduler +finds a suitable match, it assigns the `Task` to a machine and puts the +`Task` into the `ASSIGNED` state. + +From the `ASSIGNED` state, the scheduler sends an RPC to the slave +machine containing `Task` configuration, which the slave uses to spawn +an executor responsible for the `Task`'s lifecycle. When the scheduler +receives an acknowledgment that the machine has accepted the `Task`, +the `Task` goes into `STARTING` state. + +`STARTING` state initializes a `Task` sandbox. When the sandbox is fully +initialized, Thermos begins to invoke `Process`es. Also, the slave +machine sends an update to the scheduler that the `Task` is +in `RUNNING` state. + + + +## RUNNING to terminal states + +There are various ways that an active `Task` can transition into a terminal +state. By definition, it can never leave this state. However, depending on +nature of the termination and the originating `Job` definition +(e.g. `service`, `max_task_failures`), a replacement `Task` might be +scheduled. + +### Natural Termination: FINISHED, FAILED + +A `RUNNING` `Task` can terminate without direct user interaction. For +example, it may be a finite computation that finishes, even something as +simple as `echo hello world.`, or it could be an exceptional condition in +a long-lived service. If the `Task` is successful (its underlying +processes have succeeded with exit status `0` or finished without +reaching failure limits) it moves into `FINISHED` state. If it finished +after reaching a set of failure limits, it goes into `FAILED` state. + +A terminated `TASK` which is subject to rescheduling will be temporarily +`THROTTLED`, if it is considered to be flapping. A task is flapping, if its +previous invocation was terminated after less than 5 minutes (scheduler +default). The time penalty a task has to remain in the `THROTTLED` state, +before it is eligible for rescheduling, increases with each consecutive +failure. + +### Forceful Termination: KILLING, RESTARTING + +You can terminate a `Task` by issuing an `aurora job kill` command, which +moves it into `KILLING` state. The scheduler then sends the slave a +request to terminate the `Task`. If the scheduler receives a successful +response, it moves the Task into `KILLED` state and never restarts it. + +If a `Task` is forced into the `RESTARTING` state via the `aurora job restart` +command, the scheduler kills the underlying task but in parallel schedules +an identical replacement for it. + +In any case, the responsible executor on the slave follows an escalation +sequence when killing a running task: + + 1. If a `HTTPLifecycleConfig` is not present, skip to (4). + 2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds. + 3. Send a POST to the `shutdown_endpoint` and wait 5 seconds. + 4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds. + 5. Send SIGKILL (`kill -9`). + +If the executor notices that all `Process`es in a `Task` have aborted +during this sequence, it will not proceed with subsequent steps. +Note that graceful shutdown is best-effort, and due to the many +inevitable realities of distributed systems, it may not be performed. + +### Unexpected Termination: LOST + +If a `Task` stays in a transient task state for too long (such as `ASSIGNED` +or `STARTING`), the scheduler forces it into `LOST` state, creating a new +`Task` in its place that's sent into `PENDING` state. + +In addition, if the Mesos core tells the scheduler that a slave has +become unhealthy (or outright disappeared), the `Task`s assigned to that +slave go into `LOST` state and new `Task`s are created in their place. +From `PENDING` state, there is no guarantee a `Task` will be reassigned +to the same machine unless job constraints explicitly force it there. + +### Giving Priority to Production Tasks: PREEMPTING + +Sometimes a Task needs to be interrupted, such as when a non-production +Task's resources are needed by a higher priority production Task. This +type of interruption is called a *pre-emption*. When this happens in +Aurora, the non-production Task is killed and moved into +the `PREEMPTING` state when both the following are true: + +- The task being killed is a non-production task. +- The other task is a `PENDING` production task that hasn't been + scheduled due to a lack of resources. + +The scheduler UI shows the non-production task was preempted in favor of +the production task. At some point, tasks in `PREEMPTING` move to `KILLED`. + +Note that non-production tasks consuming many resources are likely to be +preempted in favor of production tasks. + +### Making Room for Maintenance: DRAINING + +Cluster operators can set slave into maintenance mode. This will transition +all `Task` running on this slave into `DRAINING` and eventually to `KILLED`. +Drained `Task`s will be restarted on other slaves for which no maintenance +has been announced yet. + + + +## State Reconciliation + +Due to the many inevitable realities of distributed systems, there might +be a mismatch of perceived and actual cluster state (e.g. a machine returns +from a `netsplit` but the scheduler has already marked all its `Task`s as +`LOST` and rescheduled them). + +Aurora regularly runs a state reconciliation process in order to detect +and correct such issues (e.g. by killing the errant `RUNNING` tasks). +By default, the proper detection of all failure scenarios and inconsistencies +may take up to an hour. + +To emphasize this point: there is no uniqueness guarantee for a single +instance of a job in the presence of network partitions. If the `Task` +requires that, it should be baked in at the application level using a +distributed coordination service such as Zookeeper. http://git-wip-us.apache.org/repos/asf/aurora/blob/2d59b697/docs/user-guide.md ---------------------------------------------------------------------- diff --git a/docs/user-guide.md b/docs/user-guide.md index df63468..656296c 100644 --- a/docs/user-guide.md +++ b/docs/user-guide.md @@ -3,14 +3,8 @@ Aurora User Guide - [Overview](#user-content-overview) - [Job Lifecycle](#user-content-job-lifecycle) - - [Life Of A Task](#user-content-life-of-a-task) - - [PENDING to RUNNING states](#user-content-pending-to-running-states) - [Task Updates](#user-content-task-updates) - - [HTTP Health Checking and Graceful Shutdown](#user-content-http-health-checking-and-graceful-shutdown) - - [Tearing a task down](#user-content-tearing-a-task-down) - - [Giving Priority to Production Tasks: PREEMPTING](#user-content-giving-priority-to-production-tasks-preempting) - - [Natural Termination: FINISHED, FAILED](#user-content-natural-termination-finished-failed) - - [Forceful Termination: KILLING, RESTARTING](#user-content-forceful-termination-killing-restarting) + - [HTTP Health Checking](#user-content-http-health-checking) - [Service Discovery](#user-content-service-discovery) - [Configuration](#user-content-configuration) - [Creating Jobs](#user-content-creating-jobs) @@ -99,60 +93,13 @@ will be around forever, e.g. by building log saving or other checkpointing mechanisms directly into your application or into your `Job` description. + Job Lifecycle ------------- -When Aurora reads a configuration file and finds a `Job` definition, it: - -1. Evaluates the `Job` definition. -2. Splits the `Job` into its constituent `Task`s. -3. Sends those `Task`s to the scheduler. -4. The scheduler puts the `Task`s into `PENDING` state, starting each - `Task`'s life cycle. - -### Life Of A Task - - - -### PENDING to RUNNING states - -When a `Task` is in the `PENDING` state, the scheduler constantly -searches for machines satisfying that `Task`'s resource request -requirements (RAM, disk space, CPU time) while maintaining configuration -constraints such as "a `Task` must run on machines dedicated to a -particular role" or attribute limit constraints such as "at most 2 -`Task`s from the same `Job` may run on each rack". When the scheduler -finds a suitable match, it assigns the `Task` to a machine and puts the -`Task` into the `ASSIGNED` state. - -From the `ASSIGNED` state, the scheduler sends an RPC to the slave -machine containing `Task` configuration, which the slave uses to spawn -an executor responsible for the `Task`'s lifecycle. When the scheduler -receives an acknowledgement that the machine has accepted the `Task`, -the `Task` goes into `STARTING` state. - -`STARTING` state initializes a `Task` sandbox. When the sandbox is fully -initialized, Thermos begins to invoke `Process`es. Also, the slave -machine sends an update to the scheduler that the `Task` is -in `RUNNING` state. - -If a `Task` stays in `ASSIGNED` or `STARTING` for too long, the -scheduler forces it into `LOST` state, creating a new `Task` in its -place that's sent into `PENDING` state. This is technically true of any -active state: if the Mesos core tells the scheduler that a slave has -become unhealthy (or outright disappeared), the `Task`s assigned to that -slave go into `LOST` state and new `Task`s are created in their place. -From `PENDING` state, there is no guarantee a `Task` will be reassigned -to the same machine unless job constraints explicitly force it there. - -If there is a state mismatch, (e.g. a machine returns from a `netsplit` -and the scheduler has marked all its `Task`s `LOST` and rescheduled -them), a state reconciliation process kills the errant `RUNNING` tasks, -which may take up to an hour. But to emphasize this point: there is no -uniqueness guarantee for a single instance of a job in the presence of -network partitions. If the Task requires that, it should be baked in at -the application level using a distributed coordination service such as -Zookeeper. +`Job`s and their `Task`s have various states that are described in the [Task Lifecycle](task-lifecycle.md). +However, in day to day use, you'll be primarily concerned with launching new jobs and updating existing ones. + ### Task Updates @@ -186,14 +133,14 @@ with old instance configs and batch updates proceed backwards from the point where the update failed. E.g.; (0,1,2) (3,4,5) (6,7, 8-FAIL) results in a rollback in order (8,7,6) (5,4,3) (2,1,0). -### HTTP Health Checking and Graceful Shutdown +### HTTP Health Checking The Executor implements a protocol for rudimentary control of a task via HTTP. Tasks subscribe for this protocol by declaring a port named `health`. Take for example this configuration snippet: nginx = Process( name = 'nginx', - cmdline = './run_nginx.sh -port {{thermos.ports[http]}}') + cmdline = './run_nginx.sh -port {{thermos.ports[health]}}') When this Process is included in a job, the job will be allocated a port, and the command line will be replaced with something like: @@ -208,8 +155,6 @@ requests: | HTTP request | Description | | ------------ | ----------- | | `GET /health` | Inquires whether the task is healthy. | -| `POST /quitquitquit` | Task should initiate graceful shutdown. | -| `POST /abortabortabort` | Final warning task is being killed. | Please see the [configuration reference](configuration-reference.md#user-content-healthcheckconfig-objects) for @@ -227,62 +172,6 @@ process. WARNING: Remember to remove this when you are done, otherwise your instance will have permanently disabled health checks. -#### Tearing a task down - -The Executor follows an escalation sequence when killing a running task: - - 1. If `health` port is not present, skip to (5) - 2. POST /quitquitquit - 3. wait 5 seconds - 4. POST /abortabortabort - 5. Send SIGTERM (`kill`) - 6. Send SIGKILL (`kill -9`) - -If the Executor notices that all Processes in a Task have aborted during this sequence, it will -not proceed with subsequent steps. Note that graceful shutdown is best-effort, and due to the many -inevitable realities of distributed systems, it may not be performed. - -### Giving Priority to Production Tasks: PREEMPTING - -Sometimes a Task needs to be interrupted, such as when a non-production -Task's resources are needed by a higher priority production Task. This -type of interruption is called a *pre-emption*. When this happens in -Aurora, the non-production Task is killed and moved into -the `PREEMPTING` state when both the following are true: - -- The task being killed is a non-production task. -- The other task is a `PENDING` production task that hasn't been - scheduled due to a lack of resources. - -Since production tasks are much more important, Aurora kills off the -non-production task to free up resources for the production task. The -scheduler UI shows the non-production task was preempted in favor of the -production task. At some point, tasks in `PREEMPTING` move to `KILLED`. - -Note that non-production tasks consuming many resources are likely to be -preempted in favor of production tasks. - -### Natural Termination: FINISHED, FAILED - -A `RUNNING` `Task` can terminate without direct user interaction. For -example, it may be a finite computation that finishes, even something as -simple as `echo hello world. `Or it could be an exceptional condition in -a long-lived service. If the `Task` is successful (its underlying -processes have succeeded with exit status `0` or finished without -reaching failure limits) it moves into `FINISHED` state. If it finished -after reaching a set of failure limits, it goes into `FAILED` state. - -### Forceful Termination: KILLING, RESTARTING - -You can terminate a `Task` by issuing an `aurora job kill` command, which -moves it into `KILLING` state. The scheduler then sends the slave a -request to terminate the `Task`. If the scheduler receives a successful -response, it moves the Task into `KILLED` state and never restarts it. - -The scheduler has access to a non-public `RESTARTING` state. If a `Task` -is forced into the `RESTARTING` state, the scheduler kills the -underlying task but in parallel schedules an identical replacement for -it. Configuration -------------
