Added: aurora/site/source/documentation/0.12.0/client-cluster-configuration.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/client-cluster-configuration.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/client-cluster-configuration.md (added) +++ aurora/site/source/documentation/0.12.0/client-cluster-configuration.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,70 @@ +# Client Cluster Configuration + +A cluster configuration file is used by the Aurora client to describe the Aurora clusters with +which it can communicate. Ultimately this allows client users to reference clusters with short names +like us-east and eu. The following properties may be set: + + **Property** | **Type** | **Description** + :------------------------| :------- | :-------------- + **name** | String | Cluster name (Required) + **slave_root** | String | Path to mesos slave work dir (Required) + **slave_run_directory** | String | Name of mesos slave run dir (Required) + **zk** | String | Hostname of ZooKeeper instance used to resolve Aurora schedulers. + **zk_port** | Integer | Port of ZooKeeper instance used to locate Aurora schedulers (Default: 2181) + **scheduler_zk_path** | String | ZooKeeper path under which scheduler instances are registered. + **scheduler_uri** | String | URI of Aurora scheduler instance. + **proxy_url** | String | Used by the client to format URLs for display. + **auth_mechanism** | String | The authentication mechanism to use when communicating with the scheduler. (Default: UNAUTHENTICATED) + +#### name + +The name of the Aurora cluster represented by this entry. This name will be the `cluster` portion of +any job keys identifying jobs running within the cluster. + +#### slave_root + +The path on the mesos slaves where executing tasks can be found. It is used in combination with the +`slave_run_directory` property by `aurora task run` and `aurora task ssh` to change into the sandbox +directory after connecting to the host. This value should match the value passed to `mesos-slave` +as `-work_dir`. + +#### slave_run_directory + +The name of the directory where the task run can be found. This is used in combination with the +`slave_root` property by `aurora task run` and `aurora task ssh` to change into the sandbox +directory after connecting to the host. This should almost always be set to `latest`. + +#### zk + +The hostname of the ZooKeeper instance used to resolve the Aurora scheduler. Aurora uses ZooKeeper +to elect a leader. The client will connect to this ZooKeeper instance to determine the current +leader. This host should match the host passed to the scheduler as `-zk_endpoints`. + +#### zk_port + +The port on which the ZooKeeper instance is running. If not set this will default to the standard +ZooKeeper port of 2181. This port should match the port in the host passed to the scheduler as +`-zk_endpoints`. + +#### scheduler_zk_path + +The path on the ZooKeeper instance under which the Aurora serverset is registered. This value should +match the value passed to the scheduler as `-serverset_path`. + +#### scheduler_uri + +The URI of the scheduler. This would be used in place of the ZooKeeper related configuration above +in circumstances where direct communication with a single scheduler is needed (e.g. testing +environments). It is strongly advised to **never** use this property for production deploys. + +#### proxy_url + +Instead of using the hostname of the leading scheduler as the base url, if `proxy_url` is set, its +value will be used instead. In that scenario the value for `proxy_url` would be, for example, the +URL of your VIP in a loadbalancer or a roundrobin DNS name. + +#### auth_mechanism + +The identifier of an authentication mechanism that the client should use when communicating with the +scheduler. Support for values other than `UNAUTHENTICATED` requires a matching scheduler-side +[security configuration](/documentation/0.12.0/security/).
Added: aurora/site/source/documentation/0.12.0/client-commands.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/client-commands.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/client-commands.md (added) +++ aurora/site/source/documentation/0.12.0/client-commands.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,379 @@ +Aurora Client Commands +====================== + +- [Introduction](#introduction) +- [Cluster Configuration](#cluster-configuration) +- [Job Keys](#job-keys) +- [Modifying Aurora Client Commands](#modifying-aurora-client-commands) +- [Regular Jobs](#regular-jobs) + - [Creating and Running a Job](#creating-and-running-a-job) + - [Running a Command On a Running Job](#running-a-command-on-a-running-job) + - [Killing a Job](#killing-a-job) + - [Updating a Job](#updating-a-job) + - [Coordinated job updates](#coordinated-job-updates) + - [Renaming a Job](#renaming-a-job) + - [Restarting Jobs](#restarting-jobs) +- [Cron Jobs](#cron-jobs) +- [Comparing Jobs](#comparing-jobs) +- [Viewing/Examining Jobs](#viewingexamining-jobs) + - [Listing Jobs](#listing-jobs) + - [Inspecting a Job](#inspecting-a-job) + - [Versions](#versions) + - [Checking Your Quota](#checking-your-quota) + - [Finding a Job on Web UI](#finding-a-job-on-web-ui) + - [Getting Job Status](#getting-job-status) + - [Opening the Web UI](#opening-the-web-ui) + - [SSHing to a Specific Task Machine](#sshing-to-a-specific-task-machine) + - [Templating Command Arguments](#templating-command-arguments) + +Introduction +------------ + +Once you have written an `.aurora` configuration file that describes +your Job and its parameters and functionality, you interact with Aurora +using Aurora Client commands. This document describes all of these commands +and how and when to use them. All Aurora Client commands start with +`aurora`, followed by the name of the specific command and its +arguments. + +*Job keys* are a very common argument to Aurora commands, as well as the +gateway to useful information about a Job. Before using Aurora, you +should read the next section which describes them in detail. The section +after that briefly describes how you can modify the behavior of certain +Aurora Client commands, linking to a detailed document about how to do +that. + +This is followed by the Regular Jobs section, which describes the basic +Client commands for creating, running, and manipulating Aurora Jobs. +After that are sections on Comparing Jobs and Viewing/Examining Jobs. In +other words, various commands for getting information and metadata about +Aurora Jobs. + +Cluster Configuration +--------------------- + +The client must be able to find a configuration file that specifies available clusters. This file +declares shorthand names for clusters, which are in turn referenced by job configuration files +and client commands. + +The client will load at most two configuration files, making both of their defined clusters +available. The first is intended to be a system-installed cluster, using the path specified in +the environment variable `AURORA_CONFIG_ROOT`, defaulting to `/etc/aurora/clusters.json` if the +environment variable is not set. The second is a user-installed file, located at +`~/.aurora/clusters.json`. + +A cluster configuration is formatted as JSON. The simplest cluster configuration is one that +communicates with a single (non-leader-elected) scheduler. For example: + +```javascript +[{ + "name": "example", + "scheduler_uri": "localhost:55555", +}] +``` + +A configuration for a leader-elected scheduler would contain something like: + +```javascript +[{ + "name": "example", + "zk": "192.168.33.7", + "scheduler_zk_path": "/aurora/scheduler" +}] +``` + +For more details on cluster configuration see the +[Client Cluster Configuration](/documentation/0.12.0/client-cluster-configuration/) documentation. + +Job Keys +-------- + +A job key is a unique system-wide identifier for an Aurora-managed +Job, for example `cluster1/web-team/test/experiment204`. It is a 4-tuple +consisting of, in order, *cluster*, *role*, *environment*, and +*jobname*, separated by /s. Cluster is the name of an Aurora +cluster. Role is the Unix service account under which the Job +runs. Environment is a namespace component like `devel`, `test`, +`prod`, or `stagingN.` Jobname is the Job's name. + +The combination of all four values uniquely specifies the Job. If any +one value is different from that of another job key, the two job keys +refer to different Jobs. For example, job key +`cluster1/tyg/prod/workhorse` is different from +`cluster1/tyg/prod/workcamel` is different from +`cluster2/tyg/prod/workhorse` is different from +`cluster2/foo/prod/workhorse` is different from +`cluster1/tyg/test/workhorse.` + +Role names are user accounts existing on the slave machines. If you don't know what accounts +are available, contact your sysadmin. + +Environment names are namespaces; you can count on `prod`, `devel` and `test` existing. + +Modifying Aurora Client Commands +-------------------------------- + +For certain Aurora Client commands, you can define hook methods that run +either before or after an action that takes place during the command's +execution, as well as based on whether the action finished successfully or failed +during execution. Basically, a hook is code that lets you extend the +command's actions. The hook executes on the client side, specifically on +the machine executing Aurora commands. + +Hooks can be associated with these Aurora Client commands. + + - `job create` + - `job kill` + - `job restart` + +The process for writing and activating them is complex enough +that we explain it in a devoted document, [Hooks for Aurora Client API](/documentation/0.12.0/hooks/). + +Regular Jobs +------------ + +This section covers Aurora commands related to running, killing, +renaming, updating, and restarting a basic Aurora Job. + +### Creating and Running a Job + + aurora job create <job key> <configuration file> + +Creates and then runs a Job with the specified job key based on a `.aurora` configuration file. +The configuration file may also contain and activate hook definitions. + +### Running a Command On a Running Job + + aurora task run CLUSTER/ROLE/ENV/NAME[/INSTANCES] <cmd> + +Runs a shell command on all machines currently hosting shards of a +single Job. + +`run` supports the same command line wildcards used to populate a Job's +commands; i.e. anything in the `{{mesos.*}}` and `{{thermos.*}}` +namespaces. + +### Killing a Job + + aurora job killall CLUSTER/ROLE/ENV/NAME + +Kills all Tasks associated with the specified Job, blocking until all +are terminated. Defaults to killing all instances in the Job. + +The `<configuration file>` argument for `kill` is optional. Use it only +if it contains hook definitions and activations that affect the +kill command. + +### Updating a Job + +There are several sub-commands to manage job updates: + + aurora update start <job key> <configuration file> + aurora update info <job key> + aurora update pause <job key> + aurora update resume <job key> + aurora update abort <job key> + aurora update list <cluster> + +When you `start` a job update, the command will return once it has sent the +instructions to the scheduler. At that point, you may view detailed +progress for the update with the `info` subcommand, in addition to viewing +graphical progress in the web browser. You may also get a full listing of +in-progress updates in a cluster with `list`. + +Once an update has been started, you can `pause` to keep the update but halt +progress. This can be useful for doing things like debug a partially-updated +job to determine whether you would like to proceed. You can `resume` to +proceed. + +You may `abort` a job update regardless of the state it is in. This will +instruct the scheduler to completely abandon the job update and leave the job +in the current (possibly partially-updated) state. + +#### Coordinated job updates + +Some Aurora services may benefit from having more control over updates by explicitly +acknowledging ("heartbeating") job update progress. This may be helpful for mission-critical +service updates where explicit job health monitoring is vital during the entire job update +lifecycle. Such job updates would rely on an external service (or a custom client) periodically +pulsing an active coordinated job update via a +[pulseJobUpdate RPC](https://github.com/apache/aurora/blob/#{git_tag}/api/src/main/thrift/org/apache/aurora/gen/api.thrift)). + +A coordinated update is defined by setting a positive +[pulse_interval_secs](/documentation/0.12.0/configuration-reference/#updateconfig-objects) value in job configuration +file. If no pulses are received within specified interval the update will be blocked. A blocked +update is unable to continue rolling forward (or rolling back) but retains its active status. +It may only be unblocked by a fresh `pulseJobUpdate` call. + +NOTE: A coordinated update starts in `ROLL_FORWARD_AWAITING_PULSE` state and will not make any +progress until the first pulse arrives. However, a paused update (`ROLL_FORWARD_PAUSED` or +`ROLL_BACK_PAUSED`) is still considered active and upon resuming will immediately make progress +provided the pulse interval has not expired. + +### Renaming a Job + +Renaming is a tricky operation as downstream clients must be informed of +the new name. A conservative approach +to renaming suitable for production services is: + +1. Modify the Aurora configuration file to change the role, + environment, and/or name as appropriate to the standardized naming + scheme. +2. Check that only these naming components have changed + with `aurora diff`. + + aurora job diff CLUSTER/ROLE/ENV/NAME <job_configuration> + +3. Create the (identical) job at the new key. You may need to request a + temporary quota increase. + + aurora job create CLUSTER/ROLE/ENV/NEW_NAME <job_configuration> + +4. Migrate all clients over to the new job key. Update all links and + dashboards. Ensure that both job keys run identical versions of the + code while in this state. +5. After verifying that all clients have successfully moved over, kill + the old job. + + aurora job killall CLUSTER/ROLE/ENV/NAME + +6. If you received a temporary quota increase, be sure to let the + powers that be know you no longer need the additional capacity. + +### Restarting Jobs + +`restart` restarts all of a job key identified Job's shards: + + aurora job restart CLUSTER/ROLE/ENV/NAME[/INSTANCES] + +Restarts are controlled on the client side, so aborting +the `job restart` command halts the restart operation. + +**Note**: `job restart` only applies its command line arguments and does not +use or is affected by `update.config`. Restarting +does ***not*** involve a configuration change. To update the +configuration, use `update.config`. + +The `--config` argument for restart is optional. Use it only +if it contains hook definitions and activations that affect the +`job restart` command. + +Cron Jobs +--------- + +You can manage cron jobs using the `aurora cron` command. Please see +[cron-jobs.md](/documentation/0.12.0/cron-jobs/) for more details. + +You will see various commands and options relating to cron jobs in +`aurora -h` and similar. Ignore them, as they're not yet implemented. + +Comparing Jobs +-------------- + + aurora job diff CLUSTER/ROLE/ENV/NAME <job configuration> + +Compares a job configuration against a running job. By default the diff +is determined using `diff`, though you may choose an alternate + diff program by specifying the `DIFF_VIEWER` environment variable. + +Viewing/Examining Jobs +---------------------- + +Above we discussed creating, killing, and updating Jobs. Here we discuss +how to view and examine Jobs. + +### Listing Jobs + + aurora config list <job configuration> + +Lists all Jobs registered with the Aurora scheduler in the named cluster for the named role. + +### Inspecting a Job + + aurora job inspect CLUSTER/ROLE/ENV/NAME <job configuration> + +`inspect` verifies that its specified job can be parsed from a +configuration file, and displays the parsed configuration. + +### Checking Your Quota + + aurora quota get CLUSTER/ROLE + + Prints the production quota allocated to the role's value at the given +cluster. Only non-[dedicated](/documentation/0.12.0/deploying-aurora-scheduler/#dedicated-attribute) +[production](/documentation/0.12.0/configuration-reference/#job-objects) jobs consume quota. + +### Finding a Job on Web UI + +When you create a job, part of the output response contains a URL that goes +to the job's scheduler UI page. For example: + + vagrant@precise64:~$ aurora job create devcluster/www-data/prod/hello /vagrant/examples/jobs/hello_world.aurora + INFO] Creating job hello + INFO] Response from scheduler: OK (message: 1 new tasks pending for job www-data/prod/hello) + INFO] Job url: http://precise64:8081/scheduler/www-data/prod/hello + +You can go to the scheduler UI page for this job via `http://precise64:8081/scheduler/www-data/prod/hello` +You can go to the overall scheduler UI page by going to the part of that URL that ends at `scheduler`; `http://precise64:8081/scheduler` + +Once you click through to a role page, you see Jobs arranged +separately by pending jobs, active jobs and finished jobs. +Jobs are arranged by role, typically a service account for +production jobs and user accounts for test or development jobs. + +### Getting Job Status + + aurora job status <job_key> + +Returns the status of recent tasks associated with the +`job_key` specified Job in its supplied cluster. Typically this includes +a mix of active tasks (running or assigned) and inactive tasks +(successful, failed, and lost.) + +### Opening the Web UI + +Use the Job's web UI scheduler URL or the `aurora status` command to find out on which +machines individual tasks are scheduled. You can open the web UI via the +`open` command line command if invoked from your machine: + + aurora job open [<cluster>[/<role>[/<env>/<job_name>]]] + +If only the cluster is specified, it goes directly to that cluster's +scheduler main page. If the role is specified, it goes to the top-level +role page. If the full job key is specified, it goes directly to the job +page where you can inspect individual tasks. + +### SSHing to a Specific Task Machine + + aurora task ssh <job_key> <shard number> + +You can have the Aurora client ssh directly to the machine that has been +assigned a particular Job/shard number. This may be useful for quickly +diagnosing issues such as performance issues or abnormal behavior on a +particular machine. + +### Templating Command Arguments + + aurora task run [-e] [-t THREADS] <job_key> -- <<command-line>> + +Given a job specification, run the supplied command on all hosts and +return the output. You may use the standard Mustache templating rules: + +- `{{thermos.ports[name]}}` substitutes the specific named port of the + task assigned to this machine +- `{{mesos.instance}}` substitutes the shard id of the job's task + assigned to this machine +- `{{thermos.task_id}}` substitutes the task id of the job's task + assigned to this machine + +For example, the following type of pattern can be a powerful diagnostic +tool: + + aurora task run -t5 cluster1/tyg/devel/seizure -- \ + 'curl -s -m1 localhost:{{thermos.ports[http]}}/vars | grep uptime' + +By default, the command runs in the Task's sandbox. The `-e` option can +run the command in the executor's sandbox. This is mostly useful for +Aurora administrators. + +You can parallelize the runs by using the `-t` option. Added: aurora/site/source/documentation/0.12.0/committers.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/committers.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/committers.md (added) +++ aurora/site/source/documentation/0.12.0/committers.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,81 @@ +Setting up your email account +----------------------------- +Once your Apache ID has been set up you can configure your account and add ssh keys and setup an +email forwarding address at + + http://id.apache.org + +Additional instructions for setting up your new committer email can be found at + + http://www.apache.org/dev/user-email.html + +The recommended setup is to configure all services (mailing lists, JIRA, ReviewBoard) to send +emails to your @apache.org email address. + + +Creating a gpg key for releases +------------------------------- +In order to create a release candidate you will need a gpg key published to an external key server +and that key will need to be added to our KEYS file as well. + +1. Create a key: + + gpg --gen-key + +2. Add your gpg key to the Apache Aurora KEYS file: + + git clone https://git-wip-us.apache.org/repos/asf/aurora.git + (gpg --list-sigs <KEY ID> && gpg --armor --export <KEY ID>) >> KEYS + git add KEYS && git commit -m "Adding gpg key for <APACHE ID>" + ./rbt post -o -g + +3. Publish the key to an external key server: + + gpg --keyserver pgp.mit.edu --send-keys <KEY ID> + +4. Update the changes to the KEYS file to the Apache Aurora svn dist locations listed below: + + https://dist.apache.org/repos/dist/dev/aurora/KEYS + https://dist.apache.org/repos/dist/release/aurora/KEYS + +5. Add your key to git config for use with the release scripts: + + git config --global user.signingkey <KEY ID> + + +Creating a release +------------------ +The following will guide you through the steps to create a release candidate, vote, and finally an +official Apache Aurora release. Before starting your gpg key should be in the KEYS file and you +must have access to commit to the dist.a.o repositories. + +1. Ensure that all issues resolved for this release candidate are tagged with the correct Fix +Version in Jira, the changelog script will use this to generate the CHANGELOG in step #2. + +2. Create a release candidate. This will automatically update the CHANGELOG and commit it, create a +branch and update the current version within the trunk. To create a minor version update and publish +it run + + ./build-support/release/release-candidate -l m -p + +3. Update, if necessary, the draft email created from the `release-candidate` script in step #2 and +send the [VOTE] email to the dev@ mailing list. You can verify the release signature and checksums +by running + + ./build-support/release/verify-release-candidate + +4. Wait for the vote to complete. If the vote fails close the vote by replying to the initial [VOTE] +email sent in step #3 by editing the subject to [RESULT][VOTE] ... and noting the failure reason +(example [here](http://markmail.org/message/d4d6xtvj7vgwi76f)). Now address any issues and go back to +step #1 and run again, this time you will use the -r flag to increment the release candidate +version. This will automatically clean up the release candidate rc0 branch and source distribution. + + ./build-support/release/release-candidate -l m -r 1 -p + +5. Once the vote has successfully passed create the release + + ./build-support/release/release + +6. Update the draft email created fom the `release` script in step #5 to include the Apache ID's for +all binding votes and send the [RESULT][VOTE] email to the dev@ mailing list. + Added: aurora/site/source/documentation/0.12.0/configuration-reference.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/configuration-reference.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/configuration-reference.md (added) +++ aurora/site/source/documentation/0.12.0/configuration-reference.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,695 @@ +Aurora + Thermos Configuration Reference +======================================== + +- [Aurora + Thermos Configuration Reference](#aurora--thermos-configuration-reference) +- [Introduction](#introduction) +- [Process Schema](#process-schema) + - [Process Objects](#process-objects) + - [name](#name) + - [cmdline](#cmdline) + - [max_failures](#max_failures) + - [daemon](#daemon) + - [ephemeral](#ephemeral) + - [min_duration](#min_duration) + - [final](#final) + - [logger](#logger) +- [Task Schema](#task-schema) + - [Task Object](#task-object) + - [name](#name-1) + - [processes](#processes) + - [constraints](#constraints) + - [resources](#resources) + - [max_failures](#max_failures-1) + - [max_concurrency](#max_concurrency) + - [finalization_wait](#finalization_wait) + - [Constraint Object](#constraint-object) + - [Resource Object](#resource-object) +- [Job Schema](#job-schema) + - [Job Objects](#job-objects) + - [Services](#services) + - [Revocable Jobs](#revocable-jobs) + - [UpdateConfig Objects](#updateconfig-objects) + - [HealthCheckConfig Objects](#healthcheckconfig-objects) + - [Announcer Objects](#announcer-objects) + - [Container Objects](#container) + - [LifecycleConfig Objects](#lifecycleconfig-objects) +- [Specifying Scheduling Constraints](#specifying-scheduling-constraints) +- [Template Namespaces](#template-namespaces) + - [mesos Namespace](#mesos-namespace) + - [thermos Namespace](#thermos-namespace) +- [Basic Examples](#basic-examples) + - [hello_world.aurora](#hello_worldaurora) + - [Environment Tailoring](#environment-tailoring) + - [hello_world_productionized.aurora](#hello_world_productionizedaurora) + +Introduction +============ + +Don't know where to start? The Aurora configuration schema is very +powerful, and configurations can become quite complex for advanced use +cases. + +For examples of simple configurations to get something up and running +quickly, check out the [Tutorial](/documentation/0.12.0/tutorial/). When you feel comfortable with the basics, move +on to the [Configuration Tutorial](/documentation/0.12.0/configuration-tutorial/) for more in-depth coverage of +configuration design. + +For additional basic configuration examples, see [the end of this document](#BasicExamples). + +Process Schema +============== + +Process objects consist of required `name` and `cmdline` attributes. You can customize Process +behavior with its optional attributes. Remember, Processes are handled by Thermos. + +### Process Objects + + **Attribute Name** | **Type** | **Description** + ------------------- | :---------: | --------------------------------- + **name** | String | Process name (Required) + **cmdline** | String | Command line (Required) + **max_failures** | Integer | Maximum process failures (Default: 1) + **daemon** | Boolean | When True, this is a daemon process. (Default: False) + **ephemeral** | Boolean | When True, this is an ephemeral process. (Default: False) + **min_duration** | Integer | Minimum duration between process restarts in seconds. (Default: 15) + **final** | Boolean | When True, this process is a finalizing one that should run last. (Default: False) + **logger** | Logger | Struct defining the log behavior for the process. (Default: Empty) + +#### name + +The name is any valid UNIX filename string (specifically no +slashes, NULLs or leading periods). Within a Task object, each Process name +must be unique. + +#### cmdline + +The command line run by the process. The command line is invoked in a bash +subshell, so can involve fully-blown bash scripts. However, nothing is +supplied for command-line arguments so `$*` is unspecified. + +#### max_failures + +The maximum number of failures (non-zero exit statuses) this process can +have before being marked permanently failed and not retried. If a +process permanently fails, Thermos looks at the failure limit of the task +containing the process (usually 1) to determine if the task has +failed as well. + +Setting `max_failures` to 0 makes the process retry +indefinitely until it achieves a successful (zero) exit status. +It retries at most once every `min_duration` seconds to prevent +an effective denial of service attack on the coordinating Thermos scheduler. + +#### daemon + +By default, Thermos processes are non-daemon. If `daemon` is set to True, a +successful (zero) exit status does not prevent future process runs. +Instead, the process reinvokes after `min_duration` seconds. +However, the maximum failure limit still applies. A combination of +`daemon=True` and `max_failures=0` causes a process to retry +indefinitely regardless of exit status. This should be avoided +for very short-lived processes because of the accumulation of +checkpointed state for each process run. When running in Mesos +specifically, `max_failures` is capped at 100. + +#### ephemeral + +By default, Thermos processes are non-ephemeral. If `ephemeral` is set to +True, the process' status is not used to determine if its containing task +has completed. For example, consider a task with a non-ephemeral +webserver process and an ephemeral logsaver process +that periodically checkpoints its log files to a centralized data store. +The task is considered finished once the webserver process has +completed, regardless of the logsaver's current status. + +#### min_duration + +Processes may succeed or fail multiple times during a single task's +duration. Each of these is called a *process run*. `min_duration` is +the minimum number of seconds the scheduler waits before running the +same process. + +#### final + +Processes can be grouped into two classes: ordinary processes and +finalizing processes. By default, Thermos processes are ordinary. They +run as long as the task is considered healthy (i.e., no failure +limits have been reached.) But once all regular Thermos processes +finish or the task reaches a certain failure threshold, it +moves into a "finalization" stage and runs all finalizing +processes. These are typically processes necessary for cleaning up the +task, such as log checkpointers, or perhaps e-mail notifications that +the task completed. + +Finalizing processes may not depend upon ordinary processes or +vice-versa, however finalizing processes may depend upon other +finalizing processes and otherwise run as a typical process +schedule. + +#### logger + +The default behavior of Thermos is to store stderr/stdout logs in files which grow unbounded. +In the event that you have large log volume, you may want to configure Thermos to automatically rotate logs +after they grow to a certain size, which can prevent your job from using more than its allocated +disk space. + +A Logger union consists of a destination enum, a mode enum and a rotation policy. +It's to set where the process logs should be sent using `destination`. Default +option is `file`. Its also possible to specify `console` to get logs output +to stdout/stderr, `none` to suppress any logs output or `both` to send logs to files and +console output. In case of using `none` or `console` rotation attributes are ignored. +Rotation policies only apply to loggers whose mode is `rotate`. The acceptable values +for the LoggerMode enum are `standard` and `rotate`. The rotation policy applies to both +stderr and stdout. + +By default, all processes use the `standard` LoggerMode. + + **Attribute Name** | **Type** | **Description** + ------------------- | :---------------: | --------------------------------- + **destination** | LoggerDestination | Destination of logs. (Default: `file`) + **mode** | LoggerMode | Mode of the logger. (Default: `standard`) + **rotate** | RotatePolicy | An optional rotation policy. + +A RotatePolicy describes log rotation behavior for when `mode` is set to `rotate`. It is ignored +otherwise. + + **Attribute Name** | **Type** | **Description** + ------------------- | :----------: | --------------------------------- + **log_size** | Integer | Maximum size (in bytes) of an individual log file. (Default: 100 MiB) + **backups** | Integer | The maximum number of backups to retain. (Default: 5) + +An example process configuration is as follows: + + process = Process( + name='process', + logger=Logger( + destination=LoggerDestination('both'), + mode=LoggerMode('rotate'), + rotate=RotatePolicy(log_size=5*MB, backups=5) + ) + ) + +Task Schema +=========== + +Tasks fundamentally consist of a `name` and a list of Process objects stored as the +value of the `processes` attribute. Processes can be further constrained with +`constraints`. By default, `name`'s value inherits from the first Process in the +`processes` list, so for simple `Task` objects with one Process, `name` +can be omitted. In Mesos, `resources` is also required. + +### Task Object + + **param** | **type** | **description** + --------- | :---------: | --------------- + ```name``` | String | Process name (Required) (Default: ```processes0.name```) + ```processes``` | List of ```Process``` objects | List of ```Process``` objects bound to this task. (Required) + ```constraints``` | List of ```Constraint``` objects | List of ```Constraint``` objects constraining processes. + ```resources``` | ```Resource``` object | Resource footprint. (Required) + ```max_failures``` | Integer | Maximum process failures before being considered failed (Default: 1) + ```max_concurrency``` | Integer | Maximum number of concurrent processes (Default: 0, unlimited concurrency.) + ```finalization_wait``` | Integer | Amount of time allocated for finalizing processes, in seconds. (Default: 30) + +#### name +`name` is a string denoting the name of this task. It defaults to the name of the first Process in +the list of Processes associated with the `processes` attribute. + +#### processes + +`processes` is an unordered list of `Process` objects. To constrain the order +in which they run, use `constraints`. + +##### constraints + +A list of `Constraint` objects. Currently it supports only one type, +the `order` constraint. `order` is a list of process names +that should run in the order given. For example, + + process = Process(cmdline = "echo hello {{name}}") + task = Task(name = "echoes", + processes = [process(name = "jim"), process(name = "bob")], + constraints = [Constraint(order = ["jim", "bob"])) + +Constraints can be supplied ad-hoc and in duplicate. Not all +Processes need be constrained, however Tasks with cycles are +rejected by the Thermos scheduler. + +Use the `order` function as shorthand to generate `Constraint` lists. +The following: + + order(process1, process2) + +is shorthand for + + [Constraint(order = [process1.name(), process2.name()])] + +The `order` function accepts Process name strings `('foo', 'bar')` or the processes +themselves, e.g. `foo=Process(name='foo', ...)`, `bar=Process(name='bar', ...)`, +`constraints=order(foo, bar)`. + + +#### resources + +Takes a `Resource` object, which specifies the amounts of CPU, memory, and disk space resources +to allocate to the Task. + +#### max_failures + +`max_failures` is the number of failed processes needed for the `Task` to be +marked as failed. + +For example, assume a Task has two Processes and a `max_failures` value of `2`: + + template = Process(max_failures=10) + task = Task( + name = "fail", + processes = [ + template(name = "failing", cmdline = "exit 1"), + template(name = "succeeding", cmdline = "exit 0") + ], + max_failures=2) + +The `failing` Process could fail 10 times before being marked as permanently +failed, and the `succeeding` Process could succeed on the first run. However, +the task would succeed despite only allowing for two failed processes. To be more +specific, there would be 10 failed process runs yet 1 failed process. Both processes +would have to fail for the Task to fail. + + + +#### max_concurrency + +For Tasks with a number of expensive but otherwise independent +processes, you may want to limit the amount of concurrency +the Thermos scheduler provides rather than artificially constraining +it via `order` constraints. For example, a test framework may +generate a task with 100 test run processes, but wants to run it on +a machine with only 4 cores. You can limit the amount of parallelism to +4 by setting `max_concurrency=4` in your task configuration. + +For example, the following task spawns 180 Processes ("mappers") +to compute individual elements of a 180 degree sine table, all dependent +upon one final Process ("reducer") to tabulate the results: + + def make_mapper(id): + return Process( + name = "mapper%03d" % id, + cmdline = "echo 'scale=50;s(%d\*4\*a(1)/180)' | bc -l > + temp.sine_table.%03d" % (id, id)) + + def make_reducer(): + return Process(name = "reducer", cmdline = "cat temp.\* | nl \> sine\_table.txt + && rm -f temp.\*") + + processes = map(make_mapper, range(180)) + + task = Task( + name = "mapreduce", + processes = processes + [make\_reducer()], + constraints = [Constraint(order = [mapper.name(), 'reducer']) for mapper + in processes], + max_concurrency = 8) + +#### finalization_wait + +Tasks have three active stages: `ACTIVE`, `CLEANING`, and `FINALIZING`. The +`ACTIVE` stage is when ordinary processes run. This stage lasts as +long as Processes are running and the Task is healthy. The moment either +all Processes have finished successfully or the Task has reached a +maximum Process failure limit, it goes into `CLEANING` stage and send +SIGTERMs to all currently running Processes and their process trees. +Once all Processes have terminated, the Task goes into `FINALIZING` stage +and invokes the schedule of all Processes with the "final" attribute set to True. + +This whole process from the end of `ACTIVE` stage to the end of `FINALIZING` +must happen within `finalization_wait` seconds. If it does not +finish during that time, all remaining Processes are sent SIGKILLs +(or if they depend upon uncompleted Processes, are +never invoked.) + +Client applications with higher priority may force a shorter +finalization wait (e.g. through parameters to `thermos kill`), so this +is mostly a best-effort signal. + + +### Constraint Object + +Current constraint objects only support a single ordering constraint, `order`, +which specifies its processes run sequentially in the order given. By +default, all processes run in parallel when bound to a `Task` without +ordering constraints. + + param | type | description + ----- | :----: | ----------- + order | List of String | List of processes by name (String) that should be run serially. + +### Resource Object + +Specifies the amount of CPU, Ram, and disk resources the task needs. See the +[Resource Isolation document](/documentation/0.12.0/resources/) for suggested values and to understand how +resources are allocated. + + param | type | description + ----- | :----: | ----------- + ```cpu``` | Float | Fractional number of cores required by the task. + ```ram``` | Integer | Bytes of RAM required by the task. + ```disk``` | Integer | Bytes of disk required by the task. + + +Job Schema +========== + +### Job Objects + + name | type | description + ------ | :-------: | ------- + ```task``` | Task | The Task object to bind to this job. Required. + ```name``` | String | Job name. (Default: inherited from the task attribute's name) + ```role``` | String | Job role account. Required. + ```cluster``` | String | Cluster in which this job is scheduled. Required. + ```environment``` | String | Job environment, default ```devel```. Must be one of ```prod```, ```devel```, ```test``` or ```staging<number>```. + ```contact``` | String | Best email address to reach the owner of the job. For production jobs, this is usually a team mailing list. + ```instances```| Integer | Number of instances (sometimes referred to as replicas or shards) of the task to create. (Default: 1) + ```cron_schedule``` | String | Cron schedule in cron format. May only be used with non-service jobs. See [Cron Jobs](/documentation/0.12.0/cron-jobs/) for more information. Default: None (not a cron job.) + ```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL_EXISTING Kill the previous run, and schedule the new run CANCEL_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING) + ```update_config``` | ```UpdateConfig``` object | Parameters for controlling the rate and policy of rolling updates. + ```constraints``` | dict | Scheduling constraints for the tasks. See the section on the [constraint specification language](#Specifying-Scheduling-Constraints) + ```service``` | Boolean | If True, restart tasks regardless of success or failure. (Default: False) + ```max_task_failures``` | Integer | Maximum number of failures after which the task is considered to have failed (Default: 1) Set to -1 to allow for infinite failures + ```priority``` | Integer | Preemption priority to give the task (Default 0). Tasks with higher priorities may preempt tasks at lower priorities. + ```production``` | Boolean | Whether or not this is a production task that may [preempt](/documentation/0.12.0/resources/#task-preemption) other tasks (Default: False). Production job role must have the appropriate [quota](/documentation/0.12.0/resources/#resource-quota). + ```health_check_config``` | ```HealthCheckConfig``` object | Parameters for controlling a task's health checks. HTTP health check is only used if a health port was assigned with a command line wildcard. + ```container``` | ```Container``` object | An optional container to run all processes inside of. + ```lifecycle``` | ```LifecycleConfig``` object | An optional task lifecycle configuration that dictates commands to be executed on startup/teardown. HTTP lifecycle is enabled by default if the "health" port is requested. See [LifecycleConfig Objects](#lifecycleconfig-objects) for more information. + ```tier``` | String | Task tier type. When set to `revocable` requires the task to run with Mesos revocable resources. This is work [in progress](https://issues.apache.org/jira/browse/AURORA-1343) and is currently only supported for the revocable tasks. The ultimate goal is to simplify task configuration by hiding various configuration knobs behind a task tier definition. See AURORA-1343 and AURORA-1443 for more details. + +### Services + +Jobs with the `service` flag set to True are called Services. The `Service` +alias can be used as shorthand for `Job` with `service=True`. +Services are differentiated from non-service Jobs in that tasks +always restart on completion, whether successful or unsuccessful. +Jobs without the service bit set only restart up to +`max_task_failures` times and only if they terminated unsuccessfully +either due to human error or machine failure. + +### Revocable Jobs + +**WARNING**: This feature is currently in alpha status. Do not use it in production clusters! + +Mesos [supports a concept of revocable tasks](http://mesos.apache.org/documentation/latest/oversubscription/) +by oversubscribing machine resources by the amount deemed safe to not affect the existing +non-revocable tasks. Aurora now supports revocable jobs via a `tier` setting set to `revocable` +value. + +More implementation details in this [ticket](https://issues.apache.org/jira/browse/AURORA-1343). + +Scheduler must be [configured](/documentation/0.12.0/deploying-aurora-scheduler/#configuring-resource-oversubscription) +to receive revocable offers from Mesos and accept revocable jobs. If not configured properly +revocable tasks will never get assigned to hosts and will stay in PENDING. + +### UpdateConfig Objects + +Parameters for controlling the rate and policy of rolling updates. + +| object | type | description +| ---------------------------- | :------: | ------------ +| ```batch_size``` | Integer | Maximum number of shards to be updated in one iteration (Default: 1) +| ```watch_secs``` | Integer | Minimum number of seconds a shard must remain in ```RUNNING``` state before considered a success (Default: 45) +| ```max_per_shard_failures``` | Integer | Maximum number of restarts per shard during update. Increments total failure count when this limit is exceeded. (Default: 0) +| ```max_total_failures``` | Integer | Maximum number of shard failures to be tolerated in total during an update. Cannot be greater than or equal to the total number of tasks in a job. (Default: 0) +| ```rollback_on_failure``` | boolean | When False, prevents auto rollback of a failed update (Default: True) +| ```wait_for_batch_completion```| boolean | When True, all threads from a given batch will be blocked from picking up new instances until the entire batch is updated. This essentially simulates the legacy sequential updater algorithm. (Default: False) +| ```pulse_interval_secs``` | Integer | Indicates a [coordinated update](/documentation/0.12.0/client-commands/#coordinated-job-updates). If no pulses are received within the provided interval the update will be blocked. Beta-updater only. Will fail on submission when used with client updater. (Default: None) + +### HealthCheckConfig Objects + +*Note: ```endpoint```, ```expected_response``` and ```expected_response_code``` are deprecated from ```HealthCheckConfig``` and must be definied in ```HttpHealthChecker```.* + +Parameters for controlling a task's health checks via HTTP or a shell command. + +| param | type | description +| ------- | :-------: | -------- +| ```health_checker``` | HealthCheckerConfig | Configure what kind of health check to use. +| ```initial_interval_secs``` | Integer | Initial delay for performing a health check. (Default: 15) +| ```interval_secs``` | Integer | Interval on which to check the task's health. (Default: 10) +| ```max_consecutive_failures``` | Integer | Maximum number of consecutive failures that will be tolerated before considering a task unhealthy (Default: 0) +| ```timeout_secs``` | Integer | Health check timeout. (Default: 1) + +### HealthCheckerConfig Objects +| param | type | description +| ------- | :-------: | -------- +| ```http``` | HttpHealthChecker | Configure health check to use HTTP. (Default) +| ```shell``` | ShellHealthChecker | Configure health check via a shell command. + + +### HttpHealthChecker Objects +| param | type | description +| ------- | :-------: | -------- +| ```endpoint``` | String | HTTP endpoint to check (Default: /health) +| ```expected_response``` | String | If not empty, fail the HTTP health check if the response differs. Case insensitive. (Default: ok) +| ```expected_response_code``` | Integer | If not zero, fail the HTTP health check if the response code differs. (Default: 0) + +### ShellHealthChecker Objects +| param | type | description +| ------- | :-------: | -------- +| ```shell_command``` | String | An alternative to HTTP health checking. Specifies a shell command that will be executed. Any non-zero exit status will be interpreted as a health check failure. + + +### Announcer Objects + +If the `announce` field in the Job configuration is set, each task will be +registered in the ServerSet `/aurora/role/environment/jobname` in the +zookeeper ensemble configured by the executor (which can be optionally overriden by specifying +zk_path parameter). If no Announcer object is specified, +no announcement will take place. For more information about ServerSets, see the [User Guide](/documentation/0.12.0/user-guide/). + +| object | type | description +| ------- | :-------: | -------- +| ```primary_port``` | String | Which named port to register as the primary endpoint in the ServerSet (Default: `http`) +| ```portmap``` | dict | A mapping of additional endpoints to announced in the ServerSet (Default: `{ 'aurora': '{{primary_port}}' }`) +| ```zk_path``` | String | Zookeeper serverset path override (executor must be started with the --announcer-allow-custom-serverset-path parameter) + +### Port aliasing with the Announcer `portmap` + +The primary endpoint registered in the ServerSet is the one allocated to the port +specified by the `primary_port` in the `Announcer` object, by default +the `http` port. This port can be referenced from anywhere within a configuration +as `{{thermos.ports[http]}}`. + +Without the port map, each named port would be allocated a unique port number. +The `portmap` allows two different named ports to be aliased together. The default +`portmap` aliases the `aurora` port (i.e. `{{thermos.ports[aurora]}}`) to +the `http` port. Even though the two ports can be referenced independently, +only one port is allocated by Mesos. Any port referenced in a `Process` object +but which is not in the portmap will be allocated dynamically by Mesos and announced as well. + +It is possible to use the portmap to alias names to static port numbers, e.g. +`{'http': 80, 'https': 443, 'aurora': 'http'}`. In this case, referencing +`{{thermos.ports[aurora]}}` would look up `{{thermos.ports[http]}}` then +find a static port 80. No port would be requested of or allocated by Mesos. + +Static ports should be used cautiously as Aurora does nothing to prevent two +tasks with the same static port allocations from being co-scheduled. +External constraints such as slave attributes should be used to enforce such +guarantees should they be needed. + +### Container Objects + +*Note: The only container type currently supported is "docker". Docker support is currently EXPERIMENTAL.* +*Note: In order to correctly execute processes inside a job, the Docker container must have python 2.7 installed.* + +Describes the container the job's processes will run inside. + + param | type | description + ----- | :----: | ----------- + ```docker``` | Docker | A docker container to use. + +### Docker Object + + param | type | description + ----- | :----: | ----------- + ```image``` | String | The name of the docker image to execute. If the image does not exist locally it will be pulled with ```docker pull```. + ```parameters``` | List(Parameter) | Additional parameters to pass to the docker containerizer. + +### Docker Parameter Object + +Docker CLI parameters. This needs to be enabled by the scheduler `enable_docker_parameters` option. +See [Docker Command Line Reference](https://docs.docker.com/reference/commandline/run/) for valid parameters. + + param | type | description + ----- | :----: | ----------- + ```name``` | String | The name of the docker parameter. E.g. volume + ```value``` | String | The value of the parameter. E.g. /usr/local/bin:/usr/bin:rw + +### LifecycleConfig Objects + +*Note: The only lifecycle configuration supported is the HTTP lifecycle via the HTTPLifecycleConfig.* + + param | type | description + ----- | :----: | ----------- + ```http``` | HTTPLifecycleConfig | Configure the lifecycle manager to send lifecycle commands to the task via HTTP. + +### HTTPLifecycleConfig Objects + + param | type | description + ----- | :----: | ----------- + ```port``` | String | The named port to send POST commands (Default: health) + ```graceful_shutdown_endpoint``` | String | Endpoint to hit to indicate that a task should gracefully shutdown. (Default: /quitquitquit) + ```shutdown_endpoint``` | String | Endpoint to hit to give a task its final warning before being killed. (Default: /abortabortabort) + +#### graceful_shutdown_endpoint + +If the Job is listening on the port as specified by the HTTPLifecycleConfig +(default: `health`), a HTTP POST request will be sent over localhost to this +endpoint to request that the task gracefully shut itself down. This is a +courtesy call before the `shutdown_endpoint` is invoked a fixed amount of +time later. + +#### shutdown_endpoint + +If the Job is listening on the port as specified by the HTTPLifecycleConfig +(default: `health`), a HTTP POST request will be sent over localhost to this +endpoint to request as a final warning before being shut down. If the task +does not shut down on its own after this, it will be forcefully killed + + +Specifying Scheduling Constraints +================================= + +In the `Job` object there is a map `constraints` from String to String +allowing the user to tailor the schedulability of tasks within the job. + +Each slave in the cluster is assigned a set of string-valued +key/value pairs called attributes. For example, consider the host +`cluster1-aaa-03-sr2` and its following attributes (given in key:value +format): `host:cluster1-aaa-03-sr2` and `rack:aaa`. + +The constraint map's key value is the attribute name in which we +constrain Tasks within our Job. The value is how we constrain them. +There are two types of constraints: *limit constraints* and *value +constraints*. + +| constraint | description +| ------------- | -------------- +| Limit | A string that specifies a limit for a constraint. Starts with <code>'limit:</code> followed by an Integer and closing single quote, such as ```'limit:1'```. +| Value | A string that specifies a value for a constraint. To include a list of values, separate the values using commas. To negate the values of a constraint, start with a ```!``` ```.``` + +You can also control machine diversity using constraints. The below +constraint ensures that no more than two instances of your job may run +on a single host. Think of this as a "group by" limit. + + constraints = { + 'host': 'limit:2', + } + +Likewise, you can use constraints to control rack diversity, e.g. at +most one task per rack: + + constraints = { + 'rack': 'limit:1', + } + +Use these constraints sparingly as they can dramatically reduce Tasks' schedulability. + +Template Namespaces +=================== + +Currently, a few Pystachio namespaces have special semantics. Using them +in your configuration allow you to tailor application behavior +through environment introspection or interact in special ways with the +Aurora client or Aurora-provided services. + +### mesos Namespace + +The `mesos` namespace contains variables which relate to the `mesos` slave +which launched the task. The `instance` variable can be used +to distinguish between Task replicas. + +| variable name | type | description +| --------------- | :--------: | ------------- +| ```instance``` | Integer | The instance number of the created task. A job with 5 replicas has instance numbers 0, 1, 2, 3, and 4. +| ```hostname``` | String | The instance hostname that the task was launched on. + +### thermos Namespace + +The `thermos` namespace contains variables that work directly on the +Thermos platform in addition to Aurora. This namespace is fully +compatible with Tasks invoked via the `thermos` CLI. + +| variable | type | description | +| :----------: | --------- | ------------ | +| ```ports``` | map of string to Integer | A map of names to port numbers | +| ```task_id``` | string | The task ID assigned to this task. | + +The `thermos.ports` namespace is automatically populated by Aurora when +invoking tasks on Mesos. When running the `thermos` command directly, +these ports must be explicitly mapped with the `-P` option. + +For example, if '{{`thermos.ports[http]`}}' is specified in a `Process` +configuration, it is automatically extracted and auto-populated by +Aurora, but must be specified with, for example, `thermos -P http:12345` +to map `http` to port 12345 when running via the CLI. + +Basic Examples +============== + +These are provided to give a basic understanding of simple Aurora jobs. + +### hello_world.aurora + +Put the following in a file named `hello_world.aurora`, substituting your own values +for values such as `cluster`s. + + import os + hello_world_process = Process(name = 'hello_world', cmdline = 'echo hello world') + + hello_world_task = Task( + resources = Resources(cpu = 0.1, ram = 16 * MB, disk = 16 * MB), + processes = [hello_world_process]) + + hello_world_job = Job( + cluster = 'cluster1', + role = os.getenv('USER'), + task = hello_world_task) + + jobs = [hello_world_job] + +Then issue the following commands to create and kill the job, using your own values for the job key. + + aurora job create cluster1/$USER/test/hello_world hello_world.aurora + + aurora job kill cluster1/$USER/test/hello_world + +### Environment Tailoring + +#### hello_world_productionized.aurora + +Put the following in a file named `hello_world_productionized.aurora`, substituting your own values +for values such as `cluster`s. + + include('hello_world.aurora') + + production_resources = Resources(cpu = 1.0, ram = 512 * MB, disk = 2 * GB) + staging_resources = Resources(cpu = 0.1, ram = 32 * MB, disk = 512 * MB) + hello_world_template = hello_world( + name = "hello_world-{{cluster}}" + task = hello_world(resources=production_resources)) + + jobs = [ + # production jobs + hello_world_template(cluster = 'cluster1', instances = 25), + hello_world_template(cluster = 'cluster2', instances = 15), + + # staging jobs + hello_world_template( + cluster = 'local', + instances = 1, + task = hello_world(resources=staging_resources)), + ] + +Then issue the following commands to create and kill the job, using your own values for the job key + + aurora job create cluster1/$USER/test/hello_world-cluster1 hello_world_productionized.aurora + + aurora job kill cluster1/$USER/test/hello_world-cluster1
