Added: aurora/site/source/documentation/0.18.0/reference/task-lifecycle.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/reference/task-lifecycle.md?rev=1799392&view=auto ============================================================================== --- aurora/site/source/documentation/0.18.0/reference/task-lifecycle.md (added) +++ aurora/site/source/documentation/0.18.0/reference/task-lifecycle.md Wed Jun 21 06:36:21 2017 @@ -0,0 +1,148 @@ +# Task Lifecycle + +When Aurora reads a configuration file and finds a `Job` definition, it: + +1. Evaluates the `Job` definition. +2. Splits the `Job` into its constituent `Task`s. +3. Sends those `Task`s to the scheduler. +4. The scheduler puts the `Task`s into `PENDING` state, starting each + `Task`'s life cycle. + + + + +Please note, a couple of task states described below are missing from +this state diagram. + + +## PENDING to RUNNING states + +When a `Task` is in the `PENDING` state, the scheduler constantly +searches for machines satisfying that `Task`'s resource request +requirements (RAM, disk space, CPU time) while maintaining configuration +constraints such as "a `Task` must run on machines dedicated to a +particular role" or attribute limit constraints such as "at most 2 +`Task`s from the same `Job` may run on each rack". When the scheduler +finds a suitable match, it assigns the `Task` to a machine and puts the +`Task` into the `ASSIGNED` state. + +From the `ASSIGNED` state, the scheduler sends an RPC to the agent +machine containing `Task` configuration, which the agent uses to spawn +an executor responsible for the `Task`'s lifecycle. When the scheduler +receives an acknowledgment that the machine has accepted the `Task`, +the `Task` goes into `STARTING` state. + +`STARTING` state initializes a `Task` sandbox. When the sandbox is fully +initialized, Thermos begins to invoke `Process`es. Also, the agent +machine sends an update to the scheduler that the `Task` is +in `RUNNING` state, only after the task satisfies the liveness requirements. +See [Health Checking](../features/services#health-checking) for more details +for how to configure health checks. + + + +## RUNNING to terminal states + +There are various ways that an active `Task` can transition into a terminal +state. By definition, it can never leave this state. However, depending on +nature of the termination and the originating `Job` definition +(e.g. `service`, `max_task_failures`), a replacement `Task` might be +scheduled. + +### Natural Termination: FINISHED, FAILED + +A `RUNNING` `Task` can terminate without direct user interaction. For +example, it may be a finite computation that finishes, even something as +simple as `echo hello world.`, or it could be an exceptional condition in +a long-lived service. If the `Task` is successful (its underlying +processes have succeeded with exit status `0` or finished without +reaching failure limits) it moves into `FINISHED` state. If it finished +after reaching a set of failure limits, it goes into `FAILED` state. + +A terminated `TASK` which is subject to rescheduling will be temporarily +`THROTTLED`, if it is considered to be flapping. A task is flapping, if its +previous invocation was terminated after less than 5 minutes (scheduler +default). The time penalty a task has to remain in the `THROTTLED` state, +before it is eligible for rescheduling, increases with each consecutive +failure. + +### Forceful Termination: KILLING, RESTARTING + +You can terminate a `Task` by issuing an `aurora job kill` command, which +moves it into `KILLING` state. The scheduler then sends the agent a +request to terminate the `Task`. If the scheduler receives a successful +response, it moves the Task into `KILLED` state and never restarts it. + +If a `Task` is forced into the `RESTARTING` state via the `aurora job restart` +command, the scheduler kills the underlying task but in parallel schedules +an identical replacement for it. + +In any case, the responsible executor on the agent follows an escalation +sequence when killing a running task: + + 1. If a `HttpLifecycleConfig` is not present, skip to (4). + 2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds. + 3. Send a POST to the `shutdown_endpoint` and wait 5 seconds. + 4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds. + 5. Send SIGKILL (`kill -9`). + +If the executor notices that all `Process`es in a `Task` have aborted +during this sequence, it will not proceed with subsequent steps. +Note that graceful shutdown is best-effort, and due to the many +inevitable realities of distributed systems, it may not be performed. + +### Unexpected Termination: LOST + +If a `Task` stays in a transient task state for too long (such as `ASSIGNED` +or `STARTING`), the scheduler forces it into `LOST` state, creating a new +`Task` in its place that's sent into `PENDING` state. + +In addition, if the Mesos core tells the scheduler that a agent has +become unhealthy (or outright disappeared), the `Task`s assigned to that +agent go into `LOST` state and new `Task`s are created in their place. +From `PENDING` state, there is no guarantee a `Task` will be reassigned +to the same machine unless job constraints explicitly force it there. + +### Giving Priority to Production Tasks: PREEMPTING + +Sometimes a Task needs to be interrupted, such as when a non-production +Task's resources are needed by a higher priority production Task. This +type of interruption is called a *pre-emption*. When this happens in +Aurora, the non-production Task is killed and moved into +the `PREEMPTING` state when both the following are true: + +- The task being killed is a non-production task. +- The other task is a `PENDING` production task that hasn't been + scheduled due to a lack of resources. + +The scheduler UI shows the non-production task was preempted in favor of +the production task. At some point, tasks in `PREEMPTING` move to `KILLED`. + +Note that non-production tasks consuming many resources are likely to be +preempted in favor of production tasks. + +### Making Room for Maintenance: DRAINING + +Cluster operators can set agent into maintenance mode. This will transition +all `Task` running on this agent into `DRAINING` and eventually to `KILLED`. +Drained `Task`s will be restarted on other agents for which no maintenance +has been announced yet. + + + +## State Reconciliation + +Due to the many inevitable realities of distributed systems, there might +be a mismatch of perceived and actual cluster state (e.g. a machine returns +from a `netsplit` but the scheduler has already marked all its `Task`s as +`LOST` and rescheduled them). + +Aurora regularly runs a state reconciliation process in order to detect +and correct such issues (e.g. by killing the errant `RUNNING` tasks). +By default, the proper detection of all failure scenarios and inconsistencies +may take up to an hour. + +To emphasize this point: there is no uniqueness guarantee for a single +instance of a job in the presence of network partitions. If the `Task` +requires that, it should be baked in at the application level using a +distributed coordination service such as Zookeeper.
Added: aurora/site/source/documentation/latest/operations/troubleshooting.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/troubleshooting.md?rev=1799392&view=auto ============================================================================== --- aurora/site/source/documentation/latest/operations/troubleshooting.md (added) +++ aurora/site/source/documentation/latest/operations/troubleshooting.md Wed Jun 21 06:36:21 2017 @@ -0,0 +1,106 @@ +# Troubleshooting + +So you've started your first cluster and are running into some issues? We've collected some common +stumbling blocks and solutions here to help get you moving. + +## Replicated log not initialized + +### Symptoms +- Scheduler RPCs and web interface claim `Storage is not READY` +- Scheduler log repeatedly prints messages like + + ``` + I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status + received a broadcasted recover request + I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response + from a replica in EMPTY status + ``` + +### Solution +When you create a new cluster, you need to inform a quorum of schedulers that they are safe to +consider their database to be empty by [initializing](../installation/#finalizing) the +replicated log. This is done to prevent the scheduler from modifying the cluster state in the event +of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path. + + +## No distinct leader elected + +### Symptoms +Either no scheduler or multiple scheduler believe to be leading. + +### Solution +Verify the [network configuration](../configuration/#network-configuration) of the Aurora +scheduler is correct: + +* The `LIBPROCESS_IP:LIBPROCESS_PORT` endpoints must be reachable from all coordinator nodes running + a scheduler or a Mesos master. +* Hostname lookups have to resolve to public ips rather than local ones that cannot be reached + from another node. + +In addition, double-check the [quota settings](../configuration/#replicated-log-configuration) of the +replicated log. + + +## Scheduler not registered + +### Symptoms +Scheduler log contains + + Framework has not been registered within the tolerated delay. + +### Solution +Double-check that the scheduler is configured correctly to reach the Mesos master. If you are registering +the master in ZooKeeper, make sure command line argument to the master: + + --zk=zk://$ZK_HOST:2181/mesos/master + +is the same as the one on the scheduler: + + -mesos_master_address=zk://$ZK_HOST:2181/mesos/master + + +## Scheduler not running + +### Symptoms +The scheduler process commits suicide regularly. This happens under error conditions, but +also on purpose in regular intervals. + +### Solution +Aurora is meant to be run under supervision. You have to configure a supervisor like +[Monit](http://mmonit.com/monit/), [supervisord](http://supervisord.org/), or systemd to run the +scheduler and restart it whenever it fails or exists on purpose. + +Aurora supports an active health checking protocol on its admin HTTP interface - if a `GET /health` +times out or returns anything other than `200 OK` the scheduler process is unhealthy and should be +restarted. + +For example, monit can be configured with + + if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart + +assuming you set `-http_port=8081`. + + +## Executor crashing or hanging + +### Symptoms +Launched task instances never transition to `STARTING` or `RUNNING` but immediately transition +to `FAILED` or `LOST`. + +### Solution +The executor might be failing due to unknown internal errors such as a missing native dependency +of the Mesos executor library. Open the Mesos UI and navigate to the failing +task in question. Inspect the various log files in order to learn about what is going on. + + +## Observer does not discover tasks + +### Symptoms +The observer UI does not list any tasks. When navigating from the scheduler UI to the state of +a particular task instance the observer returns `Error: 404 Not Found`. + +### Solution +The observer is refreshing its internal state every couple of seconds. If waiting a few seconds +does not resolve the issue, check that the `--mesos-root` setting of the observer and the +`--work_dir` option of the Mesos agent are in sync. For details, see our +[Install instructions](../installation/#worker-configuration). Added: aurora/site/source/documentation/latest/operations/upgrades.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/upgrades.md?rev=1799392&view=auto ============================================================================== --- aurora/site/source/documentation/latest/operations/upgrades.md (added) +++ aurora/site/source/documentation/latest/operations/upgrades.md Wed Jun 21 06:36:21 2017 @@ -0,0 +1,41 @@ +# Upgrading Aurora + +Aurora can be updated from one version to the next without any downtime or restarts of running +jobs. The same holds true for Mesos. + +Generally speaking, Mesos and Aurora strive for a +1/-1 version compatibility, i.e. all components +are meant to be forward and backwards compatible for at least one version. This implies it +does not really matter in which order updates are carried out. + +Exceptions to this rule are documented in the [Aurora release-notes](../../../RELEASE-NOTES/) +and the [Mesos upgrade instructions](https://mesos.apache.org/documentation/latest/upgrades/). + + +## Instructions + +To upgrade Aurora, follow these steps: + +1. Update the first scheduler instance by updating its software and restarting its process. +2. Wait until the scheduler is up and its [Replicated Log](../configuration/#replicated-log-configuration) + caught up with the other schedulers in the cluster. The log has caught up if `log/recovered` has + the value `1`. You can check the metric via `curl LIBPROCESS_IP:LIBPROCESS_PORT/metrics/snapshot`, + where ip and port refer to the [libmesos configuration](../configuration/#network-configuration) + settings of the scheduler instance. +3. Proceed with the next scheduler until all instances are updated. +4. Update the Aurora executor deployed to the compute nodes of your cluster. Jobs will continue + running with the old version of the executor, and will only be launched by the new one once + they are restarted eventually due to natural cluster churn. +5. Distribute the new Aurora client to your users. + + +## Best Practices + +Even though not absolutely mandatory, we advice to adhere to the following rules: + +* Never skip any major or minor releases when updating. If you have to catch up several releases you + have to deploy all intermediary versions. Skipping bugfix releases is acceptable though. +* Verify all updates on a test cluster before touching your production deployments. +* To minimize the number of failovers during updates, update the currently leading scheduler + instance last. +* Update the Aurora executor on a subset of compute nodes as a canary before deploying the change to + the whole fleet. Added: aurora/site/source/documentation/latest/reference/observer-configuration.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/reference/observer-configuration.md?rev=1799392&view=auto ============================================================================== --- aurora/site/source/documentation/latest/reference/observer-configuration.md (added) +++ aurora/site/source/documentation/latest/reference/observer-configuration.md Wed Jun 21 06:36:21 2017 @@ -0,0 +1,89 @@ +# Observer Configuration Reference + +The Aurora/Thermos observer can take a variety of configuration options through command-line arguments. +A list of the available options can be seen by running `thermos_observer --long-help`. + +Please refer to the [Operator Configuration Guide](../../operations/configuration/) for details on how +to properly set the most important options. + +``` +$ thermos_observer.pex --long-help +Options: + -h, --help, --short-help + show this help message and exit. + --long-help show options from all registered modules, not just the + __main__ module. + --mesos-root=MESOS_ROOT + The mesos root directory to search for Thermos + executor sandboxes [default: /var/lib/mesos] + --ip=IP The IP address the observer will bind to. [default: + 0.0.0.0] + --port=PORT The port on which the observer should listen. + [default: 1338] + --polling_interval_secs=POLLING_INTERVAL_SECS + The number of seconds between observer refresh + attempts. [default: 5] + --task_process_collection_interval_secs=TASK_PROCESS_COLLECTION_INTERVAL_SECS + The number of seconds between per task process + resource collections. [default: 20] + --task_disk_collection_interval_secs=TASK_DISK_COLLECTION_INTERVAL_SECS + The number of seconds between per task disk resource + collections. [default: 60] + + From module twitter.common.app: + --app_daemonize Daemonize this application. [default: False] + --app_profile_output=FILENAME + Dump the profiling output to a binary profiling + format. [default: None] + --app_daemon_stderr=TWITTER_COMMON_APP_DAEMON_STDERR + Direct this app's stderr to this file if daemonized. + [default: /dev/null] + --app_debug Print extra debugging information during application + initialization. [default: False] + --app_rc_filename Print the filename for the rc file and quit. [default: + False] + --app_daemon_stdout=TWITTER_COMMON_APP_DAEMON_STDOUT + Direct this app's stdout to this file if daemonized. + [default: /dev/null] + --app_profiling Run profiler on the code while it runs. Note this can + cause slowdowns. [default: False] + --app_ignore_rc_file + Ignore default arguments from the rc file. [default: + False] + --app_pidfile=TWITTER_COMMON_APP_PIDFILE + The pidfile to use if --app_daemonize is specified. + [default: None] + + From module twitter.common.log.options: + --log_to_stdout=[scheme:]LEVEL + OBSOLETE - legacy flag, use --log_to_stderr instead. + [default: ERROR] + --log_to_stderr=[scheme:]LEVEL + The level at which logging to stderr [default: ERROR]. + Takes either LEVEL or scheme:LEVEL, where LEVEL is one + of ['INFO', 'NONE', 'WARN', 'ERROR', 'DEBUG', 'FATAL'] + and scheme is one of ['google', 'plain']. + --log_to_disk=[scheme:]LEVEL + The level at which logging to disk [default: INFO]. + Takes either LEVEL or scheme:LEVEL, where LEVEL is one + of ['INFO', 'NONE', 'WARN', 'ERROR', 'DEBUG', 'FATAL'] + and scheme is one of ['google', 'plain']. + --log_dir=DIR The directory into which log files will be generated + [default: /var/tmp]. + --log_simple Write a single log file rather than one log file per + log level [default: False]. + --log_to_scribe=[scheme:]LEVEL + The level at which logging to scribe [default: NONE]. + Takes either LEVEL or scheme:LEVEL, where LEVEL is one + of ['INFO', 'NONE', 'WARN', 'ERROR', 'DEBUG', 'FATAL'] + and scheme is one of ['google', 'plain']. + --scribe_category=CATEGORY + The category used when logging to the scribe daemon. + [default: python_default]. + --scribe_buffer Buffer messages when scribe is unavailable rather than + dropping them. [default: False]. + --scribe_host=HOST The host running the scribe daemon. [default: + localhost]. + --scribe_port=PORT The port used to connect to the scribe daemon. + [default: 1463]. +```
