svn commit: r1871319 [2/5] - in /aurora/site: ./ publish/ publish/blog/ source/blog/ source/documentation/0.22.0/ source/documentation/0.22.0/additional-resources/ source/documentation/0.22.0/development/ source/documentation/0.22.0/development/design/...

renan Thu, 12 Dec 2019 21:37:42 -0800

Added: aurora/site/source/documentation/0.22.0/features/containers.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/containers.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/containers.md (added)
+++ aurora/site/source/documentation/0.22.0/features/containers.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,130 @@
+Containers
+==========
+
+Aurora supports several containerizers, notably the Mesos containerizer and 
the Docker
+containerizer. The Mesos containerizer uses native OS features directly to 
provide isolation between
+containers, while the Docker containerizer delegates container management to 
the Docker engine.
+
+The support for launching container images via both containerizers has to be
+[enabled by a cluster operator](../../operations/configuration/#containers).
+
+Mesos Containerizer
+-------------------
+
+The Mesos containerizer is the native Mesos containerizer solution. It allows 
tasks to be
+run with an array of [pluggable isolators](../resource-isolation/) and can 
launch tasks using
+[Docker](https://github.com/docker/docker/blob/master/image/spec/v1.md) images,
+[AppC](https://github.com/appc/spec/blob/master/SPEC.md) images, or directly 
on the agent host
+filesystem.
+
+The following example (available in our [Vagrant 
environment](../../getting-started/vagrant/))
+launches a hello world example within a `debian/jessie` Docker image:
+
+    $ cat /vagrant/examples/jobs/hello_docker_image.aurora
+    hello_loop = Process(
+      name = 'hello',
+      cmdline = """
+        while true; do
+          echo hello world
+          sleep 10
+        done
+      """)
+
+    task = Task(
+      processes = [hello_loop],
+      resources = Resources(cpu=1, ram=1*MB, disk=8*MB)
+    )
+
+    jobs = [
+      Service(
+        cluster = 'devcluster',
+        environment = 'devel',
+        role = 'www-data',
+        name = 'hello_docker_image',
+        task = task,
+        container = Mesos(image=DockerImage(name='debian', tag='jessie'))
+      )
+    ]
+
+Docker and Appc images are designated using an appropriate `image` property of 
the `Mesos`
+configuration object. If either `container` or `image` is left unspecified, 
the host filesystem
+will be used. Further details of how to specify images can be found in the
+[Reference Documentation](../../reference/configuration/#mesos-object).
+
+By default, Aurora launches processes as the Linux user named like the used 
role (e.g. `www-data`
+in the example above). This user has to exist on the host filesystem. If it 
does not exist within
+the container image, it will be created automatically. Otherwise, this user 
and its primary group
+has to exist in the image with matching uid/gid.
+
+For more information on the Mesos containerizer filesystem, namespace, and 
isolator features, visit
+[Mesos 
Containerizer](http://mesos.apache.org/documentation/latest/mesos-containerizer/)
 and
+[Mesos Container 
Images](http://mesos.apache.org/documentation/latest/container-image/).
+
+
+Docker Containerizer
+--------------------
+
+The Docker containerizer launches container images using the Docker engine. It 
may often provide
+more advanced features than the native Mesos containerizer, but has to be 
installed separately to
+Mesos on each agent host.
+
+Starting with the 0.17.0 release, `image` can be specified with a 
`{{docker.image[name][tag]}}` binder so that
+the tag can be resolved to a concrete image digest. This ensures that the job 
always uses the same image
+across restarts, even if the version identified by the tag has been updated, 
guaranteeing that only job
+updates can mutate configuration.
+
+Example (available in the [Vagrant 
environment](../../getting-started/vagrant/)):
+
+    $ cat /vagrant/examples/jobs/hello_docker_engine.aurora
+    hello_loop = Process(
+      name = 'hello',
+      cmdline = """
+        while true; do
+          echo hello world
+          sleep 10
+        done
+      """)
+
+    task = Task(
+      processes = [hello_loop],
+      resources = Resources(cpu=1, ram=1*MB, disk=8*MB)
+    )
+
+    jobs = [
+      Service(
+        cluster = 'devcluster',
+        environment = 'devel',
+        role = 'www-data',
+        name = 'hello_docker',
+        task = task,
+        container = Docker(image = 'python:2.7')
+      ), Service(
+        cluster = 'devcluster',
+        environment = 'devel',
+        role = 'www-data',
+        name = 'hello_docker_engine_binding',
+        task = task,
+        container = Docker(image = '{{docker.image[library/python][2.7]}}')
+      )
+    ]
+
+Note, this feature requires a v2 Docker registry. If using a private Docker 
registry its url
+must be specified in the `clusters.json` configuration file under the key 
`docker_registry`.
+If not specified `docker_registry` defaults to `https://registry-1.docker.io` 
(Docker Hub).
+
+Example:
+    # clusters.json
+    [{
+      "name": "devcluster",
+      ...
+      "docker_registry": "https://registry.example.com";
+    }]
+
+Details of how to use Docker via the Docker engine can be found in the
+[Reference Documentation](../../reference/configuration/#docker-object). 
Please note that in order to
+correctly execute processes inside a job, the Docker container must have 
Python 2.7 and potentitally
+further Mesos dependencies installed. This limitation does not hold for Docker 
containers used via
+the Mesos containerizer.
+
+For more information on launching Docker containers through the Docker 
containerizer, visit
+[Docker 
Containerizer](http://mesos.apache.org/documentation/latest/docker-containerizer/)


Added: aurora/site/source/documentation/0.22.0/features/cron-jobs.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/cron-jobs.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/cron-jobs.md (added)
+++ aurora/site/source/documentation/0.22.0/features/cron-jobs.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,124 @@
+# Cron Jobs
+
+Aurora supports execution of scheduled jobs on a Mesos cluster using 
cron-style syntax.
+
+- [Overview](#overview)
+- [Collision Policies](#collision-policies)
+- [Failure recovery](#failure-recovery)
+- [Interacting with cron jobs via the Aurora 
CLI](#interacting-with-cron-jobs-via-the-aurora-cli)
+       - [cron schedule](#cron-schedule)
+       - [cron deschedule](#cron-deschedule)
+       - [cron start](#cron-start)
+       - [job killall, job restart, job 
kill](#job-killall-job-restart-job-kill)
+- [Technical Note About Syntax](#technical-note-about-syntax)
+- [Caveats](#caveats)
+       - [Failovers](#failovers)
+       - [Collision policy is best-effort](#collision-policy-is-best-effort)
+       - [Timezone Configuration](#timezone-configuration)
+
+## Overview
+
+A job is identified as a cron job by the presence of a
+`cron_schedule` attribute containing a cron-style schedule in the
+[`Job`](../../reference/configuration/#job-objects) object. Examples of cron 
schedules
+include "every 5 minutes" (`*/5 * * * *`), "Fridays at 17:00" (`* 17 * * 
FRI`), and
+"the 1st and 15th day of the month at 03:00" (`0 3 1,15 *`).
+
+Example (available in the [Vagrant 
environment](../../getting-started/vagrant/)):
+
+    $ cat /vagrant/examples/jobs/cron_hello_world.aurora
+    # A cron job that runs every 5 minutes.
+    jobs = [
+      Job(
+        cluster = 'devcluster',
+        role = 'www-data',
+        environment = 'test',
+        name = 'cron_hello_world',
+        cron_schedule = '*/5 * * * *',
+        task = SimpleTask(
+          'cron_hello_world',
+          'echo "Hello world from cron, the time is now $(date --rfc-822)"'),
+      ),
+    ]
+
+## Collision Policies
+
+The `cron_collision_policy` field specifies the scheduler's behavior when a 
new cron job is
+triggered while an older run hasn't finished. The scheduler has two policies 
available:
+
+* `KILL_EXISTING`: The default policy - on a collision the old instances are 
killed and a instances with the current
+configuration are started.
+* `CANCEL_NEW`: On a collision the new run is cancelled.
+
+Note that the use of `CANCEL_NEW` is likely a code smell - interrupted cron 
jobs should be able
+to recover their progress on a subsequent invocation, otherwise they risk 
having their work queue
+grow faster than they can process it.
+
+## Failure recovery
+
+Unlike with services, which aurora will always re-execute regardless of exit 
status, instances of
+cron jobs retry according to the `max_task_failures` attribute of the
+[Task](../../reference/configuration/#task-object) object. To get 
"run-until-success" semantics,
+set `max_task_failures` to `-1`.
+
+## Interacting with cron jobs via the Aurora CLI
+
+Most interaction with cron jobs takes place using the `cron` subcommand. See 
`aurora cron -h`
+for up-to-date usage instructions.
+
+### cron schedule
+Schedules a new cron job on the Aurora cluster for later runs or replaces the 
existing cron template
+with a new one. Only future runs will be affected, any existing active tasks 
are left intact.
+
+    $ aurora cron schedule devcluster/www-data/test/cron_hello_world 
/vagrant/examples/jobs/cron_hello_world.aurora
+
+### cron deschedule
+Deschedules a cron job, preventing future runs but allowing current runs to 
complete.
+
+    $ aurora cron deschedule devcluster/www-data/test/cron_hello_world
+
+### cron start
+Start a cron job immediately, outside of its normal cron schedule.
+
+    $ aurora cron start devcluster/www-data/test/cron_hello_world
+
+### job killall, job restart, job kill
+Cron jobs create instances running on the cluster that you can interact with 
like normal Aurora
+tasks with `job kill` and `job restart`.
+
+
+## Technical Note About Syntax
+
+`cron_schedule` uses a restricted subset of BSD crontab syntax. While the
+execution engine currently uses Quartz, the schedule parsing is custom, a 
subset of FreeBSD
+[crontab(5)](http://www.freebsd.org/cgi/man.cgi?crontab(5)) syntax. See
+[the 
source](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/cron/CrontabEntry.java#L106-L124)
+for details.
+
+
+## Caveats
+
+### Failovers
+No failover recovery. Aurora does not record the latest minute it fired
+triggers for across failovers. Therefore it's possible to miss triggers
+on failover. Note that this behavior may change in the future.
+
+It's necessary to sync time between schedulers with something like `ntpd`.
+Clock skew could cause double or missed triggers in the case of a failover.
+
+### Collision policy is best-effort
+Aurora aims to always have *at least one copy* of a given instance running at 
a time - it's
+an AP system, meaning it chooses Availability and Partition Tolerance at the 
expense of
+Consistency.
+
+If your collision policy was `CANCEL_NEW` and a task has terminated but
+Aurora has not noticed this Aurora will go ahead and create your new
+task.
+
+If your collision policy was `KILL_EXISTING` and a task was marked `LOST`
+but not yet GCed Aurora will go ahead and create your new task without
+attempting to kill the old one (outside the GC interval).
+
+### Timezone Configuration
+Cron timezone is configured indepdendently of JVM timezone with the 
`-cron_timezone` flag and
+defaults to UTC.

Added: aurora/site/source/documentation/0.22.0/features/custom-executors.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/custom-executors.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/custom-executors.md (added)
+++ aurora/site/source/documentation/0.22.0/features/custom-executors.md Fri 
Dec 13 05:37:33 2019
@@ -0,0 +1,166 @@
+Custom Executors
+================
+
+If the need arises to use a Mesos executor other than the Thermos executor, 
the scheduler can be
+configured to utilize a custom executor by specifying the 
`-custom_executor_config` flag.
+The flag must be set to the path of a valid executor configuration file.
+
+The configuration file must be a valid **JSON array** and contain, at minimum,
+one executor configuration including the name, command and resources fields and
+must be pointed to by the `-custom_executor_config` flag when the scheduler is
+started.
+
+### Array Entry
+
+Property                 | Description
+-----------------------  | ---------------------------------
+executor (required)      | Description of executor.
+task_prefix (required) ) | Prefix given to tasks launched with this executor's 
configuration.
+volume_mounts (optional) | Volumes to be mounted in container running executor.
+
+#### executor
+
+Property                 | Description
+-----------------------  | ---------------------------------
+name (required)          | Name of the executor.
+command (required)       | How to run the executor.
+resources (required)     | Overhead to use for each executor instance.
+
+#### command
+
+Property                 | Description
+-----------------------  | ---------------------------------
+value (required)         | The command to execute.
+arguments (optional)     | A list of arguments to pass to the command.
+uris (optional)          | List of resources to download into the task sandbox.
+shell (optional)         | Run executor via shell.
+
+A note on the command property (from 
[mesos.proto](https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto)):
+
+```
+1) If 'shell == true', the command will be launched via shell
+   (i.e., /bin/sh -c 'value'). The 'value' specified will be
+   treated as the shell command. The 'arguments' will be ignored.
+2) If 'shell == false', the command will be launched by passing
+   arguments to an executable. The 'value' specified will be
+   treated as the filename of the executable. The 'arguments'
+   will be treated as the arguments to the executable. This is
+   similar to how POSIX exec families launch processes (i.e.,
+   execlp(value, arguments(0), arguments(1), ...)).
+```
+
+##### uris (list)
+* Follows the [Mesos Fetcher 
schema](http://mesos.apache.org/documentation/latest/fetcher/)
+
+Property                 | Description
+-----------------------  | ---------------------------------
+value (required)         | Path to the resource needed in the sandbox.
+executable (optional)    | Change resource to be executable via chmod.
+extract (optional)       | Extract files from packed or compressed archives 
into the sandbox.
+cache (optional)         | Use caching mechanism provided by Mesos for 
resources.
+
+#### resources (list)
+
+Property             | Description
+-------------------  | ---------------------------------
+name (required)      | Name of the resource: cpus or mem.
+type (required)      | Type of resource. Should always be SCALAR.
+scalar (required)    | Value in float for cpus or int for mem (in MBs)
+
+### volume_mounts (list)
+
+Property                     | Description
+---------------------------  | ---------------------------------
+host_path (required)         | Host path to mount inside the container.
+container_path (required)    | Path inside the container where `host_path` 
will be mounted.
+mode (required)              | Mode in which to mount the volume, Read-Write 
(RW) or Read-Only (RO).
+
+A sample configuration is as follows:
+
+```json
+[
+    {
+      "executor": {
+        "name": "myExecutor",
+        "command": {
+          "value": "myExecutor.a",
+          "shell": "false",
+          "arguments": [
+            "localhost:2181",
+            "-verbose",
+            "-config myConfiguration.config"
+          ],
+          "uris": [
+            {
+              "value": "/dist/myExecutor.a",
+              "executable": true,
+              "extract": false,
+              "cache": true
+            },
+            {
+              "value": "/home/user/myConfiguration.config",
+              "executable": false,
+              "extract": false,
+              "cache": false
+            }
+          ]
+        },
+        "resources": [
+          {
+            "name": "cpus",
+            "type": "SCALAR",
+            "scalar": {
+              "value": 1.00
+            }
+          },
+          {
+            "name": "mem",
+            "type": "SCALAR",
+            "scalar": {
+              "value": 512
+            }
+          }
+        ]
+      },
+      "volume_mounts": [
+        {
+          "mode": "RO",
+          "container_path": "/path/on/container",
+          "host_path": "/path/to/host/directory"
+        },
+        {
+          "mode": "RW",
+          "container_path": "/container",
+          "host_path": "/host"
+        }
+      ],
+      "task_prefix": "my-executor-"
+    }
+]
+```
+
+It should be noted that if you do not use Thermos or a Thermos based executor, 
links in the scheduler's
+Web UI for tasks will not work (at least for the time being).
+Some information about launched tasks can still be accessed via the Mesos Web 
UI or via the Aurora Client.
+
+### Using a custom executor
+
+To launch tasks using a custom executor,
+an [ExecutorConfig](../../reference/configuration/#executorconfig-objects) 
object must be added to
+the Job or Service object. The `name` parameter of ExecutorConfig must match 
the name of an executor
+defined in the JSON object provided to the scheduler at startup time.
+
+For example, if we desire to launch tasks using `myExecutor` (defined above), 
we may do so in
+the following manner:
+
+```
+jobs = [Service(
+  task = task,
+  cluster = 'devcluster',
+  role = 'www-data',
+  environment = 'prod',
+  name = 'hello',
+  executor_config = ExecutorConfig(name='myExecutor'))]
+```
+
+This will create a Service Job which will launch tasks using myExecutor 
instead of Thermos.

Added: aurora/site/source/documentation/0.22.0/features/job-updates.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/job-updates.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/job-updates.md (added)
+++ aurora/site/source/documentation/0.22.0/features/job-updates.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,123 @@
+Aurora Job Updates
+==================
+
+`Job` configurations can be updated at any point in their lifecycle.
+Usually updates are done incrementally using a process called a *rolling
+upgrade*, in which Tasks are upgraded in small groups, one group at a
+time.  Updates are done using various Aurora Client commands.
+
+
+Rolling Job Updates
+-------------------
+
+There are several sub-commands to manage job updates:
+
+    aurora update start <job key> <configuration file>
+    aurora update info <job key>
+    aurora update pause <job key>
+    aurora update resume <job key>
+    aurora update abort <job key>
+    aurora update list <cluster>
+
+When you `start` a job update, the command will return once it has sent the
+instructions to the scheduler.  At that point, you may view detailed
+progress for the update with the `info` subcommand, in addition to viewing
+graphical progress in the web browser.  You may also get a full listing of
+in-progress updates in a cluster with `list`.
+
+Once an update has been started, you can `pause` to keep the update but halt
+progress.  This can be useful for doing things like debug a  partially-updated
+job to determine whether you would like to proceed.  You can `resume` to
+proceed.
+
+You may `abort` a job update regardless of the state it is in. This will
+instruct the scheduler to completely abandon the job update and leave the job
+in the current (possibly partially-updated) state.
+
+For a configuration update, the Aurora Scheduler calculates required changes
+by examining the current job config state and the new desired job config.
+It then starts a *rolling batched update process* by going through every batch
+and performing these operations, in order:
+
+- If an instance is not present in the scheduler but is present in
+  the new config, then the instance is created.
+- If an instance is present in both the scheduler and the new config, then
+  the scheduler diffs both task configs. If it detects any changes, it
+  performs an instance update by killing the old config instance and adds
+  the new config instance.
+- If an instance is present in the scheduler but isn't in the new config,
+  then that instance is killed.
+
+The Aurora Scheduler continues through the instance list until all tasks are
+updated and in `RUNNING`. If the scheduler determines the update is not going
+well (based on the criteria specified in the UpdateConfig), it cancels the 
update.
+
+Update cancellation runs a procedure similar to the described above
+update sequence, but in reverse order. New instance configs are swapped
+with old instance configs and batch updates proceed backwards
+from the point where the update failed. E.g.; (0,1,2) (3,4,5) (6,7,
+8-FAIL) results in a rollback in order (8,7,6) (5,4,3) (2,1,0).
+
+For details on how to control a job update, please see the
+[UpdateConfig](../../reference/configuration/#updateconfig-objects) 
configuration object.
+
+
+Coordinated Job Updates
+------------------------
+
+Some Aurora services may benefit from having more control over updates by 
explicitly
+acknowledging ("heartbeating") job update progress. This may be helpful for 
mission-critical
+service updates where explicit job health monitoring is vital during the 
entire job update
+lifecycle. Such job updates would rely on an external service (or a custom 
client) periodically
+pulsing an active coordinated job update via a
+[pulseJobUpdate 
RPC](https://github.com/apache/aurora/blob/rel/0.22.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+
+A coordinated update is defined by setting a positive
+[pulse_interval_secs](../../reference/configuration/#updateconfig-objects) 
value in job configuration
+file. If no pulses are received within specified interval the update will be 
blocked. A blocked
+update is unable to continue rolling forward (or rolling back) but retains its 
active status.
+It may only be unblocked by a fresh `pulseJobUpdate` call.
+
+NOTE: A coordinated update starts in `ROLL_FORWARD_AWAITING_PULSE` state and 
will not make any
+progress until the first pulse arrives. However, a paused update 
(`ROLL_FORWARD_PAUSED` or
+`ROLL_BACK_PAUSED`) is still considered active and upon resuming will 
immediately make progress
+provided the pulse interval has not expired.
+
+
+SLA-Aware Updates
+-----------------
+
+Updates can take advantage of [Custom SLA 
Requirements](../../features/sla-requirements/) and
+specify the `sla_aware=True` option within
+[UpdateConfig](../../reference/configuration/#updateconfig-objects) to only 
update instances if
+the action will maintain the task's SLA requirements. This feature allows 
updates to avoid killing
+too many instances in the face of unexpected failures outside of the update 
range.
+
+See the [Using the `sla_aware` 
option](../../reference/configuration/#using-the-sla-aware-option)
+for more information on how to use this feature.
+
+
+Canary Deployments
+------------------
+
+Canary deployments are a pattern for rolling out updates to a subset of job 
instances,
+in order to test different code versions alongside the actual production job.
+It is a risk-mitigation strategy for job owners and commonly used in a form 
where
+job instance 0 runs with a different configuration than the instances 1-N.
+
+For example, consider a job with 4 instances that each
+request 1 core of cpu, 1 GB of RAM, and 1 GB of disk space as specified
+in the configuration file `hello_world.aurora`. If you want to
+update it so it requests 2 GB of RAM instead of 1. You can create a new
+configuration file to do that called `new_hello_world.aurora` and
+issue
+
+    aurora update start <job_key_value>/0-1 new_hello_world.aurora
+
+This results in instances 0 and 1 having 1 cpu, 2 GB of RAM, and 1 GB of disk 
space,
+while instances 2 and 3 have 1 cpu, 1 GB of RAM, and 1 GB of disk space. If 
instance 3
+dies and restarts, it restarts with 1 cpu, 1 GB RAM, and 1 GB disk space.
+
+So that means there are two simultaneous task configurations for the same job
+at the same time, just valid for different ranges of instances. While this 
isn't a recommended
+pattern, it is valid and supported by the Aurora scheduler.

Added: aurora/site/source/documentation/0.22.0/features/mesos-fetcher.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/mesos-fetcher.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/mesos-fetcher.md (added)
+++ aurora/site/source/documentation/0.22.0/features/mesos-fetcher.md Fri Dec 
13 05:37:33 2019
@@ -0,0 +1,46 @@
+Mesos Fetcher
+=============
+
+Mesos has support for downloading resources into the sandbox through the
+use of the [Mesos 
Fetcher](http://mesos.apache.org/documentation/latest/fetcher/)
+
+Aurora supports passing URIs to the Mesos Fetcher dynamically by including
+a list of URIs in job submissions.
+
+How to use
+----------
+The scheduler flag `-enable_mesos_fetcher` must be set to true.
+
+Currently only the scheduler side of this feature has been implemented
+so a modification to the existing client, or a custom Thrift client are 
required
+to make use of this feature.
+
+If using a custom Thrift client, the list of URIs must be included in 
TaskConfig
+as the `mesosFetcherUris` field.
+
+Each Mesos Fetcher URI has the following data members:
+
+|Property | Description|
+|---------|------|
+|value (required)  |Path to the resource needed in the sandbox.|
+|extract (optional)|Extract files from packed or compressed archives into the 
sandbox.|
+|cache (optional) | Use caching mechanism provided by Mesos for resources.|
+
+Note that this structure is very similar to the one provided for downloading
+resources needed for a [custom executor](../../operations/configuration/).
+
+This is because both features use the Mesos fetcher to retrieve resources into
+the sandbox. However, one, the custom executor feature, has a static set of 
URIs
+set in the server side, and the other, the Mesos Fetcher feature, is a dynamic 
set
+of URIs set at the time of job submission.
+
+Security Implications
+---------------------
+There are security implications that must be taken into account when enabling 
this feature.
+**Enabling this feature may potentially enable any job submitting user to 
perform a privilege escalation.**
+
+Until a more through solution is created, one step that has been taken to 
mitigate this issue
+is to statically mark every user submitted URI as non-executable. This is in 
contrast to the set of URIs
+set in the custom executor feature which may mark any URI as executable.
+
+If the need arises to mark a downloaded URI as executable, please consider 
using the custom executor feature.
\ No newline at end of file

Added: aurora/site/source/documentation/0.22.0/features/multitenancy.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/multitenancy.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/multitenancy.md (added)
+++ aurora/site/source/documentation/0.22.0/features/multitenancy.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,82 @@
+Multitenancy
+============
+
+Aurora is a multi-tenant system that can run jobs of multiple clients/tenants.
+Going beyond the [resource isolation on an individual 
host](../resource-isolation/), it is
+crucial to prevent those jobs from stepping on each others toes.
+
+
+Job Namespaces
+--------------
+
+The namespace for jobs in Aurora follows a hierarchical structure. This is 
meant to make it easier
+to differentiate between different jobs. A job key consists of four parts. The 
four parts are
+`<cluster>/<role>/<environment>/<jobname>` in that order:
+
+* Cluster refers to the name of a particular Aurora installation.
+* Role names are user accounts.
+* Environment names are namespaces.
+* Jobname is the custom name of your job.
+
+Role names correspond to user accounts. They are used for
+[authentication](../../operations/security/), as the linux user used to run 
jobs, and for the
+assignment of [quota](#preemption). If you don't know what accounts are 
available, contact your
+sysadmin.
+
+The environment component in the job key, serves as a namespace. The values for
+environment are validated in the scheduler. By default allowing any of 
`devel`, `test`,
+`production`, and any value matching the regular expression `staging[0-9]*`. 
This validation can be
+changed to allow any arbitrary regular expression by setting the scheduler 
option `allowed_job_environments`.
+
+None of the values imply any difference in the scheduling behavior. 
Conventionally, the
+"environment" is set so as to indicate a certain level of stability in the 
behavior of the job
+by ensuring that an appropriate level of testing has been performed on the 
application code. e.g.
+in the case of a typical Job, releases may progress through the following 
phases in order of
+increasing level of stability: `devel`, `test`, `staging`, `production`.
+
+
+Configuration Tiers
+-------------------
+
+Tier is a predefined bundle of task configuration options. Aurora schedules 
tasks and assigns them
+resources based on their tier assignment. The default scheduler tier 
configuration allows for
+3 tiers:
+
+ - `revocable`: The `revocable` tier requires the task to run with 
[revocable](../resource-isolation/#oversubscription)
+ resources.
+ - `preemptible`: Setting the taskâs tier to `preemptible` allows for the 
possibility of that task
+ being [preempted](#preemption) by other tasks when cluster is running low on 
resources.
+ - `preferred`: The `preferred` tier prevents the task from using 
[revocable](../resource-isolation/#oversubscription)
+ resources and from being [preempted](#preemption).
+
+Since it is possible that a cluster is configured with a custom tier 
configuration, users should
+consult their cluster administrator to be informed of the tiers supported by 
the cluster. Attempts
+to schedule jobs with an unsupported tier will be rejected by the scheduler.
+
+
+Preemption
+----------
+
+In order to guarantee that important production jobs are always running, 
Aurora supports
+preemption.
+
+Let's consider we have a pending job that is candidate for scheduling but 
resource shortage pressure
+prevents this. Active tasks can become the victim of preemption, if:
+
+ - both candidate and victim are owned by the same role and the
+   [priority](../../reference/configuration/#job-objects) of a victim is lower 
than the
+   [priority](../../reference/configuration/#job-objects) of the candidate.
+ - OR a victim is a `preemptible` or `revocable` [tier](#configuration-tiers) 
task and the candidate
+   is a `preferred` [tier](#configuration-tiers) task.
+
+In other words, tasks from `preferred` 
[tier](../../reference/configuration/#job-objects) jobs may
+preempt tasks from any `preemptible` or `revocable` job. However, a 
`preferred` task may only be
+preempted by tasks from `preferred` jobs in the same role with higher 
[priority](../../reference/configuration/#job-objects).
+
+Aurora requires resource quotas for [production non-dedicated 
jobs](../../reference/configuration/#job-objects).
+Quota is enforced at the job role level and when set, defines a 
non-preemptible pool of compute resources within
+that role. All job types (service, adhoc or cron) require role resource quota 
unless a job has
+[dedicated constraint set](../constraints/#dedicated-attribute).
+
+To grant quota to a particular role in production, an operator can use the 
command
+`aurora_admin set_quota`.

Added: aurora/site/source/documentation/0.22.0/features/resource-isolation.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/resource-isolation.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/resource-isolation.md 
(added)
+++ aurora/site/source/documentation/0.22.0/features/resource-isolation.md Fri 
Dec 13 05:37:33 2019
@@ -0,0 +1,181 @@
+Resources Isolation and Sizing
+==============================
+
+This document assumes Aurora and Mesos have been configured
+using our [recommended resource isolation 
settings](../../operations/configuration/#resource-isolation).
+
+- [Isolation](#isolation)
+- [Sizing](#sizing)
+- [Oversubscription](#oversubscription)
+
+
+Isolation
+---------
+
+Aurora is a multi-tenant system; a single software instance runs on a
+server, serving multiple clients/tenants. To share resources among
+tenants, it leverages Mesos for isolation of:
+
+* CPU
+* GPU
+* memory
+* disk space
+* ports
+
+CPU is a soft limit, and handled differently from memory and disk space.
+Too low a CPU value results in throttling your application and
+slowing it down. Memory and disk space are both hard limits; when your
+application goes over these values, it's killed.
+
+### CPU Isolation
+
+Mesos can be configured to use a quota based CPU scheduler (the *Completely*
+*Fair Scheduler*) to provide consistent and predictable performance.
+This is effectively a guarantee of resources -- you receive at least what
+you requested, but also no more than you've requested.
+
+The scheduler gives applications a CPU quota for every 100 ms interval.
+When an application uses its quota for an interval, it is throttled for
+the rest of the 100 ms. Usage resets for each interval and unused
+quota does not carry over.
+
+For example, an application specifying 4.0 CPU has access to 400 ms of
+CPU time every 100 ms. This CPU quota can be used in different ways,
+depending on the application and available resources. Consider the
+scenarios shown in this diagram.
+
+![CPU Availability](../images/CPUavailability.png)
+
+* *Scenario A*: the application can use up to 4 cores continuously for
+every 100 ms interval. It is never throttled and starts processing
+new requests immediately.
+
+* *Scenario B* : the application uses up to 8 cores (depending on
+availability) but is throttled after 50 ms. The CPU quota resets at the
+start of each new 100 ms interval.
+
+* *Scenario C* : is like Scenario A, but there is a garbage collection
+event in the second interval that consumes all CPU quota. The
+application throttles for the remaining 75 ms of that interval and
+cannot service requests until the next interval. In this example, the
+garbage collection finished in one interval but, depending on how much
+garbage needs collecting, it may take more than one interval and further
+delay service of requests.
+
+*Technical Note*: Mesos considers logical cores, also known as
+hyperthreading or SMT cores, as the unit of CPU.
+
+### Memory Isolation
+
+Mesos uses dedicated memory allocation. Your application always has
+access to the amount of memory specified in your configuration. The
+application's memory use is defined as the sum of the resident set size
+(RSS) of all processes in a shard. Each shard is considered
+independently.
+
+In other words, say you specified a memory size of 10GB. Each shard
+would receive 10GB of memory. If an individual shard's memory demands
+exceed 10GB, that shard is killed, but the other shards continue
+working.
+
+*Technical note*: Total memory size is not enforced at allocation time,
+so your application can request more than its allocation without getting
+an ENOMEM. However, it will be killed shortly after.
+
+### Disk Space
+
+Disk space used by your application is defined as the sum of the files'
+disk space in your application's directory, including the `stdout` and
+`stderr` logged from your application. Each shard is considered
+independently. You should use off-node storage for your application's
+data whenever possible.
+
+In other words, say you specified disk space size of 100MB. Each shard
+would receive 100MB of disk space. If an individual shard's disk space
+demands exceed 100MB, that shard is killed, but the other shards
+continue working.
+
+After your application finishes running, its allocated disk space is
+reclaimed. Thus, your job's final action should move any disk content
+that you want to keep, such as logs, to your home file system or other
+less transitory storage. Disk reclamation takes place an undefined
+period after the application finish time; until then, the disk contents
+are still available but you shouldn't count on them being so.
+
+*Technical note* : Disk space is not enforced at write so your
+application can write above its quota without getting an ENOSPC, but it
+will be killed shortly after. This is subject to change.
+
+### GPU Isolation
+
+GPU isolation will be supported for Nvidia devices starting from Mesos 1.0.
+Access to the allocated units will be exclusive with no sharing between tasks
+allowed (e.g. no fractional GPU allocation). For more details, see the
+[Mesos design 
document](https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit#heading=h.w84lz7p4eexl)
+and the [Mesos agent 
configuration](http://mesos.apache.org/documentation/latest/configuration/).
+
+### Other Resources
+
+Other resources, such as network bandwidth, do not have any performance
+guarantees. For some resources, such as memory bandwidth, there are no
+practical sharing methods so some application combinations collocated on
+the same host may cause contention.
+
+
+Sizing
+-------
+
+### CPU Sizing
+
+To correctly size Aurora-run Mesos tasks, specify a per-shard CPU value
+that lets the task run at its desired performance when at peak load
+distributed across all shards. Include reserve capacity of at least 50%,
+possibly more, depending on how critical your service is (or how
+confident you are about your original estimate : -)), ideally by
+increasing the number of shards to also improve resiliency. When running
+your application, observe its CPU stats over time. If consistently at or
+near your quota during peak load, you should consider increasing either
+per-shard CPU or the number of shards.
+
+## Memory Sizing
+
+Size for your application's peak requirement. Observe the per-instance
+memory statistics over time, as memory requirements can vary over
+different periods. Remember that if your application exceeds its memory
+value, it will be killed, so you should also add a safety margin of
+around 10-20%. If you have the ability to do so, you may also want to
+put alerts on the per-instance memory.
+
+## Disk Space Sizing
+
+Size for your application's peak requirement. Rotate and discard log
+files as needed to stay within your quota. When running a Java process,
+add the maximum size of the Java heap to your disk space requirement, in
+order to account for an out of memory error dumping the heap
+into the application's sandbox space.
+
+## GPU Sizing
+
+GPU is highly dependent on your application requirements and is only limited
+by the number of physical GPU units available on a target box.
+
+
+Oversubscription
+----------------
+
+Mesos supports [oversubscription of machine 
resources](http://mesos.apache.org/documentation/latest/oversubscription/)
+via the concept of revocable tasks. In contrast to non-revocable tasks, 
revocable tasks are best-effort.
+Mesos reserves the right to throttle or even kill them if they might affect 
existing high-priority
+user-facing services.
+
+As of today, the only revocable resource supported by Aurora are CPU and RAM 
resources. A job can
+opt-in to use those by specifying the `revocable` [Configuration 
Tier](../../features/multitenancy/#configuration-tiers).
+A revocable job will only be scheduled using revocable resources, even if 
there are plenty of
+non-revocable resources available.
+
+The Aurora scheduler must be [configured to receive revocable 
offers](../../operations/configuration/#resource-isolation)
+from Mesos and accept revocable jobs. If not configured properly revocable 
tasks will never get
+assigned to hosts and will stay in `PENDING`.
+
+For details on how to mark a job as being revocable, see the
+[Configuration Reference](../../reference/configuration/).

Added: aurora/site/source/documentation/0.22.0/features/service-discovery.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/service-discovery.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/service-discovery.md 
(added)
+++ aurora/site/source/documentation/0.22.0/features/service-discovery.md Fri 
Dec 13 05:37:33 2019
@@ -0,0 +1,43 @@
+Service Discovery
+=================
+
+It is possible for the Aurora executor to announce tasks into ServerSets for
+the purpose of service discovery.  ServerSets use the Zookeeper [group 
membership 
pattern](http://zookeeper.apache.org/doc/trunk/recipes.html#sc_outOfTheBox)
+of which there are several reference implementations:
+
+  - [C++](https://github.com/apache/mesos/blob/master/src/zookeeper/group.cpp)
+  - 
[Java](https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/zookeeper/ServerSetImpl.java#L221)
+  - 
[Python](https://github.com/twitter/commons/blob/master/src/python/twitter/common/zookeeper/serverset/serverset.py#L51)
+
+These can also be used natively in Finagle using the 
[ZookeeperServerSetCluster](https://github.com/twitter/finagle/blob/master/finagle-serversets/src/main/scala/com/twitter/finagle/zookeeper/ZookeeperServerSetCluster.scala).
+
+For more information about how to configure announcing, see the [Configuration 
Reference](../../reference/configuration/).
+
+Using Mesos DiscoveryInfo
+-------------------------
+Experimental support for populating DiscoveryInfo in Mesos is introduced in 
Aurora. This can be used to build
+custom service discovery system not using zookeeper. Please see `Service 
Discovery` section in
+[Mesos Framework Development 
guide](http://mesos.apache.org/documentation/latest/app-framework-development-guide/)
 for
+explanation of the protobuf message in Mesos.
+
+To use this feature, please enable `--populate_discovery_info` flag on 
scheduler. All jobs started by scheduler
+afterwards will have their portmap populated to Mesos and discoverable in 
`/state` endpoint in Mesos master and agent.
+
+### Using Mesos DNS
+An example is using [Mesos-DNS](https://github.com/mesosphere/mesos-dns), 
which is able to generate multiple DNS
+records. With current implementation, the example job with key 
`devcluster/vagrant/test/http-example` generates at
+least the following:
+
+1. An A record for `http_example.test.vagrant.aurora.mesos` (which only 
includes IP address);
+2. A [SRV record](https://en.wikipedia.org/wiki/SRV_record) for
+ `_http_example.test.vagrant._tcp.aurora.mesos`, which includes IP address and 
every port. This should only
+  be used if the service has one port.
+3. A SRV record `_{port-name}._http_example.test.vagrant._tcp.aurora.mesos` 
for each port name
+  defined. This should be used when the service has multiple ports. To have 
this working properly it's needed to
+  add `-populate_discovery_info` to scheduler's configuration.
+
+Things to note:
+
+1. The domain part (".mesos" in above example) can be configured in [Mesos 
DNS](http://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html);
+2. Right now, portmap and port aliases in announcer object are not reflected 
in DiscoveryInfo, therefore not visible in
+   Mesos DNS records either. This is because they are only resolved in thermos 
executors.

Added: aurora/site/source/documentation/0.22.0/features/services.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/services.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/services.md (added)
+++ aurora/site/source/documentation/0.22.0/features/services.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,116 @@
+Long-running Services
+=====================
+
+Jobs that are always restart on completion, whether successful or unsuccessful,
+are called services. This is useful for long-running processes
+such as webservices that should always be running, unless stopped explicitly.
+
+
+Service Specification
+---------------------
+
+A job is identified as a service by the presence of the flag
+``service=True` in the [`Job`](../../reference/configuration/#job-objects) 
object.
+The `Service` alias can be used as shorthand for `Job` with `service=True`.
+
+Example (available in the [Vagrant 
environment](../../getting-started/vagrant/)):
+
+    $ cat /vagrant/examples/jobs/hello_world.aurora
+    hello = Process(
+      name = 'hello',
+      cmdline = """
+        while true; do
+          echo hello world
+          sleep 10
+        done
+      """)
+
+    task = SequentialTask(
+      processes = [hello],
+      resources = Resources(cpu = 1.0, ram = 128*MB, disk = 128*MB)
+    )
+
+    jobs = [
+      Service(
+        task = task,
+        cluster = 'devcluster',
+        role = 'www-data',
+        environment = 'prod',
+        name = 'hello'
+      )
+    ]
+
+
+Jobs without the service bit set only restart up to `max_task_failures` times 
and only if they
+terminated unsuccessfully either due to human error or machine failure (see the
+[`Job`](../../reference/configuration/#job-objects) object for details).
+
+
+Ports
+-----
+
+In order to be useful, most services have to bind to one or more ports. Aurora 
enables this
+usecase via the [`thermos.ports` 
namespace](../../reference/configuration/#thermos-namespace) that
+allows to request arbitrarily named ports:
+
+
+    nginx = Process(
+      name = 'nginx',
+      cmdline = './run_nginx.sh -port {{thermos.ports[http]}}'
+    )
+
+
+When this process is included in a job, the job will be allocated a port, and 
the command line
+will be replaced with something like:
+
+    ./run_nginx.sh -port 42816
+
+Where 42816 happens to be the allocated port.
+
+For details on how to enable clients to discover this dynamically assigned 
port, see our
+[Service Discovery](../service-discovery/) documentation.
+
+
+Health Checking
+---------------
+
+Typically, the Thermos executor monitors processes within a task only by 
liveness of the forked
+process. In addition to that, Aurora has support for rudimentary health 
checking: Either via HTTP
+via custom shell scripts.
+
+For example, simply by requesting a `health` port, a process can request to be 
health checked
+via repeated calls to the `/health` endpoint:
+
+    nginx = Process(
+      name = 'nginx',
+      cmdline = './run_nginx.sh -port {{thermos.ports[health]}}'
+    )
+
+Please see the
+[configuration 
reference](../../reference/configuration/#healthcheckconfig-objects)
+for configuration options for this feature.
+
+Starting with the 0.17.0 release, job updates rely only on task health-checks 
by introducing
+a `min_consecutive_successes` parameter on the HealthCheckConfig object. This 
parameter represents
+the number of successful health checks needed before a task is moved into the 
`RUNNING` state. Tasks
+that do not have enough successful health checks within the first `n` 
attempts, are moved to the
+`FAILED` state, where `n = ceil(initial_interval_secs/interval_secs) + 
max_consecutive_failures +
+min_consecutive_successes`. In order to accommodate variability during task 
warm up, `initial_interval_secs`
+will act as a grace period. Any health-check failures during the first `m` 
attempts are ignored and
+do not count towards `max_consecutive_failures`, where `m = 
ceil(initial_interval_secs/interval_secs)`.
+
+As [job updates](../job-updates/) are based only on health-checks, it is not 
necessary to set
+`watch_secs` to the worst-case update time, it can instead be set to 0. The 
scheduler considers a
+task that is in the `RUNNING` to be healthy and proceeds to updating the next 
batch of instances.
+For details on how to control health checks, please see the
+[HealthCheckConfig](../../reference/configuration/#healthcheckconfig-objects) 
configuration object.
+Existing jobs that do not configure a health-check can fall-back to using 
`watch_secs` to
+monitor a task before considering it healthy.
+
+You can pause health checking by touching a file inside of your sandbox, named 
`.healthchecksnooze`.
+As long as that file is present, health checks will be disabled, enabling 
users to gather core
+dumps or other performance measurements without worrying about Aurora's health 
check killing
+their process.
+
+WARNING: Remember to remove this when you are done, otherwise your instance 
will have permanently
+disabled health checks.

Added: aurora/site/source/documentation/0.22.0/features/sla-metrics.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/sla-metrics.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/sla-metrics.md (added)
+++ aurora/site/source/documentation/0.22.0/features/sla-metrics.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,215 @@
+Aurora SLA Measurement
+======================
+
+- [Overview](#overview)
+- [Metric Details](#metric-details)
+  - [Platform Uptime](#platform-uptime)
+  - [Job Uptime](#job-uptime)
+  - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\))
+  - [Median Time To Starting (MTTS)](#median-time-to-starting-\(mtts\))
+  - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\))
+- [Limitations](#limitations)
+
+## Overview
+
+The primary goal of the feature is collection and monitoring of Aurora job SLA 
(Service Level
+Agreements) metrics that defining a contractual relationship between the 
Aurora/Mesos platform
+and hosted services.
+
+The Aurora SLA feature is by default only enabled for service (non-cron)
+production jobs (`"production=True"` in your `.aurora` config). It can be 
enabled for
+non-production services by an operator via the scheduler command line flag 
`-sla_non_prod_metrics`.
+
+Counters that track SLA measurements are computed periodically within the 
scheduler.
+The individual instance metrics are refreshed every minute (configurable via
+`sla_stat_refresh_interval`). The instance counters are subsequently 
aggregated by
+relevant grouping types before exporting to scheduler `/vars` endpoint (when 
using `vagrant`
+that would be `http://192.168.33.7:8081/vars`)
+
+
+## Metric Details
+
+### Platform Uptime
+
+*Aggregate amount of time a job spends in a non-runnable state due to platform 
unavailability
+or scheduling delays. This metric tracks Aurora/Mesos uptime performance and 
reflects on any
+system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task 
kills/restarts
+will not degrade this metric.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_platform_uptime_percent`
+* Per cluster - `sla_cluster_platform_uptime_percent`
+
+**Units:** percent
+
+A fault in the task environment may cause the Aurora/Mesos to have different 
views on the task state
+or lose track of the task existence. In such cases, the service task is marked 
as LOST and
+rescheduled by Aurora. For example, this may happen when the task stays in 
ASSIGNED or STARTING
+for too long or the Mesos agent becomes unhealthy (or disappears completely). 
The time between
+task entering LOST and its replacement reaching RUNNING state is counted 
towards platform downtime.
+
+Another example of a platform downtime event is the administrator-requested 
task rescheduling. This
+happens during planned Mesos agent maintenance when all agent tasks are marked 
as DRAINED and
+rescheduled elsewhere.
+
+To accurately calculate Platform Uptime, we must separate platform incurred 
downtime from user
+actions that put a service instance in a non-operational state. It is simpler 
to isolate
+user-incurred downtime and treat all other downtime as platform incurred.
+
+Currently, a user can cause a healthy service (task) downtime in only two 
ways: via `killTasks`
+or `restartShards` RPCs. For both, their affected tasks leave an audit state 
transition trail
+relevant to uptime calculations. By applying a special "SLA meaning" to 
exposed task state
+transition records, we can build a deterministic downtime trace for every 
given service instance.
+
+A task going through a state transition carries one of three possible SLA 
meanings
+(see 
[SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.22.0/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java)
 for
+sla-to-task-state mapping):
+
+* Task is UP: starts a period where the task is considered to be up and 
running from the Aurora
+  platform standpoint.
+
+* Task is DOWN: starts a period where the task cannot reach the UP state for 
some
+  non-user-related reason. Counts towards instance downtime.
+
+* Task is REMOVED from SLA: starts a period where the task is not expected to 
be UP due to
+  user initiated action or failure. We ignore this period for the uptime 
calculation purposes.
+
+This metric is recalculated over the last sampling period (last minute) to 
account for
+any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately 
adjacent to the
+sampling interval as well as adjacent REMOVED events.
+
+### Job Uptime
+
+*Percentage of the job instances considered to be in RUNNING state for the 
specified duration
+relative to request time. This is a purely application side metric that is 
considering aggregate
+uptime of all RUNNING instances. Any user- or platform initiated restarts 
directly affect
+this metric.*
+
+**Collection scope:** We currently expose job uptime values at 5 pre-defined
+percentiles (50th,75th,90th,95th and 99th):
+
+* `sla_<job_key>_job_uptime_50_00_sec`
+* `sla_<job_key>_job_uptime_75_00_sec`
+* `sla_<job_key>_job_uptime_90_00_sec`
+* `sla_<job_key>_job_uptime_95_00_sec`
+* `sla_<job_key>_job_uptime_99_00_sec`
+
+**Units:** seconds
+You can also get customized real-time stats from aurora client. See `aurora 
sla -h` for
+more details.
+
+### Median Time To Assigned (MTTA)
+
+*Median time a job spends waiting for its tasks to be assigned to a host. This 
is a combined
+metric that helps track the dependency of scheduling performance on the 
requested resources
+(user scope) as well as the internal scheduler bin-packing algorithm 
efficiency (platform scope).*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtta_ms`
+* Per cluster - `sla_cluster_mtta_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.22.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtta_ms`
+    * `sla_cpu_medium_mtta_ms`
+    * `sla_cpu_large_mtta_ms`
+    * `sla_cpu_xlarge_mtta_ms`
+    * `sla_cpu_xxlarge_mtta_ms`
+  * By RAM:
+    * `sla_ram_small_mtta_ms`
+    * `sla_ram_medium_mtta_ms`
+    * `sla_ram_large_mtta_ms`
+    * `sla_ram_xlarge_mtta_ms`
+    * `sla_ram_xxlarge_mtta_ms`
+  * By DISK:
+    * `sla_disk_small_mtta_ms`
+    * `sla_disk_medium_mtta_ms`
+    * `sla_disk_large_mtta_ms`
+    * `sla_disk_xlarge_mtta_ms`
+    * `sla_disk_xxlarge_mtta_ms`
+
+**Units:** milliseconds
+
+MTTA only considers instances that have already reached ASSIGNED state and 
ignores those
+that are still PENDING. This ensures straggler instances (e.g. with 
unreasonable resource
+constraints) do not affect metric curves.
+
+### Median Time To Starting (MTTS)
+
+*Median time a job waits for its tasks to reach STARTING state. This is a 
comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start 
initializing the sandbox
+for a task.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtts_ms`
+* Per cluster - `sla_cluster_mtts_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.22.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtts_ms`
+    * `sla_cpu_medium_mtts_ms`
+    * `sla_cpu_large_mtts_ms`
+    * `sla_cpu_xlarge_mtts_ms`
+    * `sla_cpu_xxlarge_mtts_ms`
+  * By RAM:
+    * `sla_ram_small_mtts_ms`
+    * `sla_ram_medium_mtts_ms`
+    * `sla_ram_large_mtts_ms`
+    * `sla_ram_xlarge_mtts_ms`
+    * `sla_ram_xxlarge_mtts_ms`
+  * By DISK:
+    * `sla_disk_small_mtts_ms`
+    * `sla_disk_medium_mtts_ms`
+    * `sla_disk_large_mtts_ms`
+    * `sla_disk_xlarge_mtts_ms`
+    * `sla_disk_xxlarge_mtts_ms`
+
+**Units:** milliseconds
+
+MTTS only considers instances in STARTING state. This ensures straggler 
instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+### Median Time To Running (MTTR)
+
+*Median time a job waits for its tasks to reach RUNNING state. This is a 
comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start 
executing user content.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mttr_ms`
+* Per cluster - `sla_cluster_mttr_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.22.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mttr_ms`
+    * `sla_cpu_medium_mttr_ms`
+    * `sla_cpu_large_mttr_ms`
+    * `sla_cpu_xlarge_mttr_ms`
+    * `sla_cpu_xxlarge_mttr_ms`
+  * By RAM:
+    * `sla_ram_small_mttr_ms`
+    * `sla_ram_medium_mttr_ms`
+    * `sla_ram_large_mttr_ms`
+    * `sla_ram_xlarge_mttr_ms`
+    * `sla_ram_xxlarge_mttr_ms`
+  * By DISK:
+    * `sla_disk_small_mttr_ms`
+    * `sla_disk_medium_mttr_ms`
+    * `sla_disk_large_mttr_ms`
+    * `sla_disk_xlarge_mttr_ms`
+    * `sla_disk_xxlarge_mttr_ms`
+
+**Units:** milliseconds
+
+MTTR only considers instances in RUNNING state. This ensures straggler 
instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+## Limitations
+
+* The availability of Aurora SLA metrics is bound by the scheduler 
availability.
+
+* All metrics are calculated at a pre-defined interval (currently set at 1 
minute).
+  Scheduler restarts may result in missed collections.

Added: aurora/site/source/documentation/0.22.0/features/sla-requirements.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/sla-requirements.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/sla-requirements.md (added)
+++ aurora/site/source/documentation/0.22.0/features/sla-requirements.md Fri 
Dec 13 05:37:33 2019
@@ -0,0 +1,185 @@
+SLA Requirements
+================
+
+- [Overview](#overview)
+- [Default SLA](#default-sla)
+- [Custom SLA](#custom-sla)
+  - [Count-based](#count-based)
+  - [Percentage-based](#percentage-based)
+  - [Coordinator-based](#coordinator-based)
+
+## Overview
+
+Aurora guarantees SLA requirements for jobs. These requirements limit the 
impact of cluster-wide
+maintenance operations on the jobs. For instance, when an operator upgrades
+the OS on all the Mesos agent machines, the tasks scheduled on them needs to 
be drained.
+By specifying the SLA requirements a job can make sure that it has enough 
instances to
+continue operating safely without incurring downtime.
+
+> SLA is defined as minimum number of active tasks required for a job every 
duration window.
+A task is active if it was in `RUNNING` state during the last duration window.
+
+There is a [default](#default-sla) SLA guarantee for
+[preferred](../../features/multitenancy/#configuration-tiers) tier jobs and it 
is also possible to
+specify [custom](#custom-sla) SLA requirements.
+
+## Default SLA
+
+Aurora guarantees a default SLA requirement for tasks in
+[preferred](../../features/multitenancy/#configuration-tiers) tier.
+
+> 95% of tasks in a job will be `active` for every 30 mins.
+
+
+## Custom SLA
+
+For jobs that require different SLA requirements, Aurora allows jobs to 
specify their own
+SLA requirements via the `SlaPolicies`. There are 3 different ways to express 
SLA requirements.
+
+### [Count-based](../../reference/configuration/#countslapolicy-objects)
+
+For jobs that need a minimum `number` of instances to be running all the time,
+[`CountSlaPolicy`](../../reference/configuration/#countslapolicy-objects)
+provides the ability to express the minimum number of required active 
instances (i.e. number of
+tasks that are `RUNNING` for at least `duration_secs`). For instance, if we 
have a
+`replicated-service` that has 3 instances and needs at least 2 instances every 
30 minutes to be
+treated healthy, the SLA requirement can be expressed with a
+[`CountSlaPolicy`](../../reference/configuration/#countslapolicy-objects) like 
below,
+
+```python
+Job(
+  name = 'replicated-service',
+  role = 'www-data',
+  instances = 3,
+  sla_policy = CountSlaPolicy(
+    count = 2,
+    duration_secs = 1800
+  )
+  ...
+)
+```
+
+### 
[Percentage-based](../../reference/configuration/#percentageslapolicy-objects)
+
+For jobs that need a minimum `percentage` of instances to be running all the 
time,
+[`PercentageSlaPolicy`](../../reference/configuration/#percentageslapolicy-objects)
 provides the
+ability to express the minimum percentage of required active instances (i.e. 
percentage of tasks
+that are `RUNNING` for at least `duration_secs`). For instance, if we have a 
`webservice` that
+has 10000 instances for handling peak load and cannot have more than 0.1% of 
the instances down
+for every 1 hr, the SLA requirement can be expressed with a
+[`PercentageSlaPolicy`](../../reference/configuration/#percentageslapolicy-objects)
 like below,
+
+```python
+Job(
+  name = 'frontend-service',
+  role = 'www-data',
+  instances = 10000,
+  sla_policy = PercentageSlaPolicy(
+    percentage = 99.9,
+    duration_secs = 3600
+  )
+  ...
+)
+```
+
+### 
[Coordinator-based](../../reference/configuration/#coordinatorslapolicy-objects)
+
+When none of the above methods are enough to describe the SLA requirements for 
a job, then the SLA
+calculation can be off-loaded to a custom service called the `Coordinator`. 
The `Coordinator` needs
+to expose an endpoint that will be called to check if removal of a task will 
affect the SLA
+requirements for the job. This is useful to control the number of tasks that 
undergoes maintenance
+at a time, without affected the SLA for the application.
+
+Consider the example, where we have a `storage-service` stores 2 replicas of 
an object. Each replica
+is distributed across the instances, such that replicas are stored on 
different hosts. In addition
+a consistent-hash is used for distributing the data across the instances.
+
+When an instance needs to be drained (say for host-maintenance), we have to 
make sure that at least 1 of
+the 2 replicas remains available. In such a case, a `Coordinator` service can 
be used to maintain
+the SLA guarantees required for the job.
+
+The job can be configured with a
+[`CoordinatorSlaPolicy`](../../reference/configuration/#coordinatorslapolicy-objects)
 to specify the
+coordinator endpoint and the field in the response JSON that indicates if the 
SLA will be affected
+or not affected, when the task is removed.
+
+```python
+Job(
+  name = 'storage-service',
+  role = 'www-data',
+  sla_policy = CoordinatorSlaPolicy(
+    coordinator_url = 'http://coordinator.example.com',
+    status_key = 'drain'
+  )
+  ...
+)
+```
+
+
+#### Coordinator Interface [Experimental]
+
+When a 
[`CoordinatorSlaPolicy`](../../reference/configuration/#coordinatorslapolicy-objects)
 is
+specified for a job, any action that requires removing a task
+(such as drains) will be required to get approval from the `Coordinator` 
before proceeding. The
+coordinator service needs to expose a HTTP endpoint, that can take a 
`task-key` param
+(`<cluster>/<role>/<env>/<name>/<instance>`) and a json body describing the 
task
+details, force maintenance countdown (ms) and other params and return a 
response json that will
+contain the boolean status for allowing or disallowing the task's removal.
+
+##### Request:
+```javascript
+POST /
+  ?task=<cluster>/<role>/<env>/<name>/<instance>
+
+{
+  "forceMaintenanceCountdownMs": "604755646",
+  "task": "cluster/role/devel/job/1",
+  "taskConfig": {
+    "assignedTask": {
+      "taskId": "taskA",
+      "slaveHost": "a",
+      "task": {
+        "job": {
+          "role": "role",
+          "environment": "devel",
+          "name": "job"
+        },
+        ...
+      },
+      "assignedPorts": {
+        "http": 1000
+      },
+      "instanceId": 1
+      ...
+    },
+    ...
+  }
+}
+```
+
+##### Response:
+```json
+{
+  "drain": true
+}
+```
+
+If Coordinator allows removal of the task, then the taskâs
+[termination 
lifecycle](../../reference/configuration/#httplifecycleconfig-objects)
+is triggered. If Coordinator does not allow removal, then the request will be 
retried again in the
+future.
+
+#### Coordinator Actions
+
+Coordinator endpoint get its own lock and this is used to serializes calls to 
the Coordinator.
+It guarantees that only one concurrent request is sent to a coordinator 
endpoint. This allows
+coordinators to simply look the current state of the tasks to determine its 
SLA (without having
+to worry about in-flight and pending requests). However if there are multiple 
coordinators,
+maintenance can be done in parallel across all the coordinators.
+
+_Note: Single concurrent request to a coordinator endpoint does not translate 
as exactly-once
+guarantee. The coordinator must be able to handle duplicate drain
+requests for the same task._
+
+
+

Added: aurora/site/source/documentation/0.22.0/features/webhooks.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/features/webhooks.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/features/webhooks.md (added)
+++ aurora/site/source/documentation/0.22.0/features/webhooks.md Fri Dec 13 
05:37:33 2019
@@ -0,0 +1,112 @@
+Webhooks
+========
+
+Aurora has an optional feature which allows operator to specify a file to 
configure a HTTP webhook
+to receive task state change events. It can be enabled with a scheduler flag eg
+`-webhook_config=/path/to/webhook.json`. At this point, webhooks are still 
considered *experimental*.
+
+Below is a sample configuration:
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/";,
+  "timeoutMsec": 5
+}
+```
+
+And an example of a response that you will get back:
+
+```json
+{
+    "task":
+    {
+        "cachedHashCode":0,
+        "assignedTask": {
+            "cachedHashCode":0,
+            
"taskId":"vagrant-test-http_example-8-a6cf7ec5-d793-49c7-b10f-0e14ab80bfff",
+            "task": {
+                "cachedHashCode":-1819348376,
+                "job": {
+                    "cachedHashCode":803049425,
+                    "role":"vagrant",
+                    "environment":"test",
+                    "name":"http_example"
+                    },
+                "owner": {
+                    "cachedHashCode":226895216,
+                    "user":"vagrant"
+                    },
+                "isService":true,
+                "numCpus":0.1,
+                "ramMb":16,
+                "diskMb":8,
+                "priority":0,
+                "maxTaskFailures":1,
+                "production":false,
+                "resources":[
+                    
{"cachedHashCode":729800451,"setField":"NUM_CPUS","value":0.1},
+                    
{"cachedHashCode":552899914,"setField":"RAM_MB","value":16},
+                    
{"cachedHashCode":-1547868317,"setField":"DISK_MB","value":8},
+                    
{"cachedHashCode":1957328227,"setField":"NAMED_PORT","value":"http"},
+                    
{"cachedHashCode":1954229436,"setField":"NAMED_PORT","value":"tcp"}
+                    ],
+                "constraints":[],
+                "requestedPorts":["http","tcp"],
+                "taskLinks":{"http":"http://%host%:%port:http%"},
+                "contactEmail":"vagrant@localhost",
+                "executorConfig": {
+                    "cachedHashCode":-1194797325,
+                    "name":"AuroraExecutor",
+                    "data": "{\"environment\": \"test\", 
\"health_check_config\": {\"initial_interval_secs\": 5.0, \"health_checker\": { 
\"http\": {\"expected_response_code\": 0, \"endpoint\": \"/health\", 
\"expected_response\": \"ok\"}}, \"max_consecutive_failures\": 0, 
\"timeout_secs\": 1.0, \"interval_secs\": 1.0}, \"name\": \"http_example\", 
\"service\": true, \"max_task_failures\": 1, \"cron_collision_policy\": 
\"KILL_EXISTING\", \"enable_hooks\": false, \"cluster\": \"devcluster\", 
\"task\": {\"processes\": [{\"daemon\": false, \"name\": \"echo_ports\", 
\"ephemeral\": false, \"max_failures\": 1, \"min_duration\": 5, \"cmdline\": 
\"echo \\\"tcp port: {{thermos.ports[tcp]}}; http port: 
{{thermos.ports[http]}}; alias: {{thermos.ports[alias]}}\\\"\", \"final\": 
false}, {\"daemon\": false, \"name\": \"stage_server\", \"ephemeral\": false, 
\"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"cp 
/vagrant/src/test/sh/org/apache/aurora/e2e/http_example.py .\", \"final\": 
false}, {\
 "daemon\": false, \"name\": \"run_server\", \"ephemeral\": false, 
\"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"python http_example.py 
{{thermos.ports[http]}}\", \"final\": false}], \"name\": \"http_example\", 
\"finalization_wait\": 30, \"max_failures\": 1, \"max_concurrency\": 0, 
\"resources\": {\"disk\": 8388608, \"ram\": 16777216, \"cpu\": 0.1}, 
\"constraints\": [{\"order\": [\"echo_ports\", \"stage_server\", 
\"run_server\"]}]}, \"production\": false, \"role\": \"vagrant\", \"contact\": 
\"vagrant@localhost\", \"announce\": {\"primary_port\": \"http\", \"portmap\": 
{\"alias\": \"http\"}}, \"lifecycle\": {\"http\": 
{\"graceful_shutdown_endpoint\": \"/quitquitquit\", \"port\": \"health\", 
\"shutdown_endpoint\": \"/abortabortabort\"}}, \"priority\": 0}"},
+                    "metadata":[],
+                    "container":{
+                        "cachedHashCode":-1955376216,
+                        "setField":"MESOS",
+                        "value":{"cachedHashCode":31}}
+                    },
+                    "assignedPorts":{},
+                    "instanceId":8
+        },
+        "status":"PENDING",
+        "failureCount":0,
+        "taskEvents":[
+            
{"cachedHashCode":0,"timestamp":1464992060258,"status":"PENDING","scheduler":"aurora"}]
+        },
+        "oldState":{}}
+```
+
+By default, the webhook watches all TaskStateChanges and sends events to 
configured endpoint. If you
+are only interested in certain types of TaskStateChange (transition to `LOST` 
or `FAILED` statuses),
+you can specify a whitelist of the desired task statuses in webhook.json. The 
webhook will only send
+the corresponding events for the whitelisted statuses to the configured 
endpoint.
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/";,
+  "timeoutMsec": 50,
+  "statuses": ["LOST", "FAILED"]
+}
+```
+
+If you want to whitelist all TaskStateChanges, you can add a wildcard 
character `*` to your whitelist
+like below, or simply leave out the `statuses` field in webhook.json.
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/";,
+  "timeoutMsec": 50,
+  "statuses": ["*"]
+}
+```

Added: aurora/site/source/documentation/0.22.0/getting-started/overview.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/getting-started/overview.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/getting-started/overview.md (added)
+++ aurora/site/source/documentation/0.22.0/getting-started/overview.md Fri Dec 
13 05:37:33 2019
@@ -0,0 +1,112 @@
+Aurora System Overview
+======================
+
+Apache Aurora is a service scheduler that runs on top of Apache Mesos, 
enabling you to run
+long-running services, cron jobs, and ad-hoc jobs that take advantage of 
Apache Mesos' scalability,
+fault-tolerance, and resource isolation.
+
+
+Components
+----------
+
+It is important to have an understanding of the components that make up
+a functioning Aurora cluster.
+
+![Aurora Components](../images/components.png)
+
+* **Aurora scheduler**
+  The scheduler is your primary interface to the work you run in your cluster. 
 You will
+  instruct it to run jobs, and it will manage them in Mesos for you.  You will 
also frequently use
+  the scheduler's read-only web interface as a heads-up display for what's 
running in your cluster.
+
+* **Aurora client**
+  The client (`aurora` command) is a command line tool that exposes primitives 
that you can use to
+  interact with the scheduler. The client operates on
+
+  Aurora also provides an admin client (`aurora_admin` command) that contains 
commands built for
+  cluster administrators.  You can use this tool to do things like manage user 
quotas and manage
+  graceful maintenance on machines in cluster.
+
+* **Aurora executor**
+  The executor (a.k.a. Thermos executor) is responsible for carrying out the 
workloads described in
+  the Aurora DSL (`.aurora` files).  The executor is what actually executes 
user processes.  It will
+  also perform health checking of tasks and register tasks in ZooKeeper for 
the purposes of dynamic
+  service discovery.
+
+* **Aurora observer**
+  The observer provides browser-based access to the status of individual tasks 
executing on worker
+  machines.  It gives insight into the processes executing, and facilitates 
browsing of task sandbox
+  directories.
+
+* **ZooKeeper**
+  [ZooKeeper](http://zookeeper.apache.org) is a distributed consensus system.  
In an Aurora cluster
+  it is used for reliable election of the leading Aurora scheduler and Mesos 
master.  It is also
+  used as a vehicle for service discovery, see [Service 
Discovery](../../features/service-discovery/)
+
+* **Mesos master**
+  The master is responsible for tracking worker machines and performing 
accounting of their
+  resources.  The scheduler interfaces with the master to control the cluster.
+
+* **Mesos agent**
+  The agent receives work assigned by the scheduler and executes them.  It 
interfaces with Linux
+  isolation systems like cgroups, namespaces and Docker to manage the resource 
consumption of tasks.
+  When a user task is launched, the agent will launch the executor (in the 
context of a Linux cgroup
+  or Docker container depending upon the environment), which will in turn fork 
user processes.
+
+  In earlier versions of Mesos and Aurora, the Mesos agent was known as the 
Mesos slave.
+
+
+Jobs, Tasks and Processes
+--------------------------
+
+Aurora is a Mesos framework used to schedule *jobs* onto Mesos. Mesos
+cares about individual *tasks*, but typical jobs consist of dozens or
+hundreds of task replicas. Aurora provides a layer on top of Mesos with
+its `Job` abstraction. An Aurora `Job` consists of a task template and
+instructions for creating near-identical replicas of that task (modulo
+things like "instance id" or specific port numbers which may differ from
+machine to machine).
+
+How many tasks make up a Job is complicated. On a basic level, a Job consists 
of
+one task template and instructions for creating near-identical replicas of 
that task
+(otherwise referred to as "instances" or "shards").
+
+A task can merely be a single *process* corresponding to a single
+command line, such as `python2.7 my_script.py`. However, a task can also
+consist of many separate processes, which all run within a single
+sandbox. For example, running multiple cooperating agents together,
+such as `logrotate`, `installer`, master, or agent processes. This is
+where Thermos comes in. While Aurora provides a `Job` abstraction on
+top of Mesos `Tasks`, Thermos provides a `Process` abstraction
+underneath Mesos `Task`s and serves as part of the Aurora framework's
+executor.
+
+You define `Job`s,` Task`s, and `Process`es in a configuration file.
+Configuration files are written in Python, and make use of the
+[Pystachio](https://github.com/wickman/pystachio) templating language,
+along with specific Aurora, Mesos, and Thermos commands and methods.
+The configuration files typically end with a `.aurora` extension.
+
+Summary:
+
+* Aurora manages jobs made of tasks.
+* Mesos manages tasks made of processes.
+* Thermos manages processes.
+* All that is defined in `.aurora` configuration files
+
+![Aurora hierarchy](../images/aurora_hierarchy.png)
+
+Each `Task` has a *sandbox* created when the `Task` starts and garbage
+collected when it finishes. All of a `Task'`s processes run in its
+sandbox, so processes can share state by using a shared current working
+directory.
+
+The sandbox garbage collection policy considers many factors, most
+importantly age and size. It makes a best-effort attempt to keep
+sandboxes around as long as possible post-task in order for service
+owners to inspect data and logs, should the `Task` have completed
+abnormally. But you can't design your applications assuming sandboxes
+will be around forever, e.g. by building log saving or other
+checkpointing mechanisms directly into your application or into your
+`Job` description.
+

Added: aurora/site/source/documentation/0.22.0/getting-started/tutorial.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.22.0/getting-started/tutorial.md?rev=1871319&view=auto
==============================================================================
--- aurora/site/source/documentation/0.22.0/getting-started/tutorial.md (added)
+++ aurora/site/source/documentation/0.22.0/getting-started/tutorial.md Fri Dec 
13 05:37:33 2019
@@ -0,0 +1,258 @@
+# Aurora Tutorial
+
+This tutorial shows how to use the Aurora scheduler to run (and 
"`printf-debug`")
+a hello world program on Mesos. This is the recommended document for new 
Aurora users
+to start getting up to speed on the system.
+
+- [Prerequisite](#setup-install-aurora)
+- [The Script](#the-script)
+- [Aurora Configuration](#aurora-configuration)
+- [Creating the Job](#creating-the-job)
+- [Watching the Job Run](#watching-the-job-run)
+- [Cleanup](#cleanup)
+- [Next Steps](#next-steps)
+
+
+## Prerequisite
+
+This tutorial assumes you are running [Aurora locally using 
Vagrant](../vagrant/).
+However, in general the instructions are also applicable to any other
+[Aurora installation](../../operations/installation/).
+
+Unless otherwise stated, all commands are to be run from the root of the aurora
+repository clone.
+
+
+## The Script
+
+Our "hello world" application is a simple Python script that loops
+forever, displaying the time every few seconds. Copy the code below and
+put it in a file named `hello_world.py` in the root of your Aurora repository 
clone
+(Note: this directory is the same as `/vagrant` inside the Vagrant VMs).
+
+The script has an intentional bug, which we will explain later on.
+
+<!-- NOTE: If you are changing this file, be sure to also update 
examples/vagrant/test_tutorial.sh.
+-->
+```python
+import time
+
+def main():
+  SLEEP_DELAY = 10
+  # Python experts - ignore this blatant bug.
+  for i in xrang(100):
+    print("Hello world! The time is now: %s. Sleeping for %d secs" % (
+      time.asctime(), SLEEP_DELAY))
+    time.sleep(SLEEP_DELAY)
+
+if __name__ == "__main__":
+  main()
+```
+
+## Aurora Configuration
+
+Once we have our script/program, we need to create a *configuration
+file* that tells Aurora how to manage and launch our Job. Save the below
+code in the file `hello_world.aurora`.
+
+<!-- NOTE: If you are changing this file, be sure to also update 
examples/vagrant/test_tutorial.sh.
+-->
+```python
+pkg_path = '/vagrant/hello_world.py'
+
+# we use a trick here to make the configuration change with
+# the contents of the file, for simplicity.  in a normal setting, packages 
would be
+# versioned, and the version number would be changed in the configuration.
+import hashlib
+with open(pkg_path, 'rb') as f:
+  pkg_checksum = hashlib.md5(f.read()).hexdigest()
+
+# copy hello_world.py into the local sandbox
+install = Process(
+  name = 'fetch_package',
+  cmdline = 'cp %s . && echo %s && chmod +x hello_world.py' % (pkg_path, 
pkg_checksum))
+
+# run the script
+hello_world = Process(
+  name = 'hello_world',
+  cmdline = 'python -u hello_world.py')
+
+# describe the task
+hello_world_task = SequentialTask(
+  processes = [install, hello_world],
+  resources = Resources(cpu = 1, ram = 1*MB, disk=8*MB))
+
+jobs = [
+  Service(cluster = 'devcluster',
+          environment = 'devel',
+          role = 'www-data',
+          name = 'hello_world',
+          task = hello_world_task)
+]
+```
+
+There is a lot going on in that configuration file:
+
+1. From a "big picture" viewpoint, it first defines two
+Processes. Then it defines a Task that runs the two Processes in the
+order specified in the Task definition, as well as specifying what
+computational and memory resources are available for them.  Finally,
+it defines a Job that will schedule the Task on available and suitable
+machines. This Job is the sole member of a list of Jobs; you can
+specify more than one Job in a config file.
+
+2. At the Process level, it specifies how to get your code into the
+local sandbox in which it will run. It then specifies how the code is
+actually run once the second Process starts.
+
+For more about Aurora configuration files, see the [Configuration
+Tutorial](../../reference/configuration-tutorial/) and the [Configuration
+Reference](../../reference/configuration/) (preferably after finishing this
+tutorial).
+
+
+## Creating the Job
+
+We're ready to launch our job! To do so, we use the Aurora Client to
+issue a Job creation request to the Aurora scheduler.
+
+Many Aurora Client commands take a *job key* argument, which uniquely
+identifies a Job. A job key consists of four parts, each separated by a
+"/". The four parts are  `<cluster>/<role>/<environment>/<jobname>`
+in that order:
+
+* Cluster refers to the name of a particular Aurora installation.
+* Role names are user accounts existing on the agent machines. If you
+don't know what accounts are available, contact your sysadmin.
+* Environment names are namespaces; you can count on `test`, `devel`,
+`staging` and `prod` existing.
+* Jobname is the custom name of your job.
+
+When comparing two job keys, if any of the four parts is different from
+its counterpart in the other key, then the two job keys identify two separate
+jobs. If all four values are identical, the job keys identify the same job.
+
+The `clusters.json` [client 
configuration](../../reference/client-cluster-configuration/)
+for the Aurora scheduler defines the available cluster names.
+For Vagrant, from the top-level of your Aurora repository clone, do:
+
+    $ vagrant ssh
+
+Followed by:
+
+    vagrant@aurora:~$ cat /etc/aurora/clusters.json
+
+You'll see something like the following. The `name` value shown here, 
corresponds to a job key's cluster value.
+
+```javascript
+[{
+  "name": "devcluster",
+  "zk": "192.168.33.7",
+  "scheduler_zk_path": "/aurora/scheduler",
+  "auth_mechanism": "UNAUTHENTICATED",
+  "slave_run_directory": "latest",
+  "slave_root": "/var/lib/mesos"
+}]
+```
+
+The Aurora Client command that actually runs our Job is `aurora job create`. 
It creates a Job as
+specified by its job key and configuration file arguments and runs it.
+
+    aurora job create <cluster>/<role>/<environment>/<jobname> <config_file>
+
+Or for our example:
+
+    aurora job create devcluster/www-data/devel/hello_world 
/vagrant/hello_world.aurora
+
+After entering our virtual machine using `vagrant ssh`, this returns:
+
+    vagrant@aurora:~$ aurora job create devcluster/www-data/devel/hello_world 
/vagrant/hello_world.aurora
+     INFO] Creating job hello_world
+     INFO] Checking status of devcluster/www-data/devel/hello_world
+    Job create succeeded: job 
url=http://aurora.local:8081/scheduler/www-data/devel/hello_world
+
+
+## Watching the Job Run
+
+Now that our job is running, let's see what it's doing. Access the
+scheduler web interface at 
`http://$scheduler_hostname:$scheduler_port/scheduler`
+Or when using `vagrant`, `http://192.168.33.7:8081/scheduler`
+First we see what Jobs are scheduled:
+
+![Scheduled Jobs](../images/ScheduledJobs.png)
+
+Click on your user name, which in this case was `www-data`, and we see the 
Jobs associated
+with that role:
+
+![Role Jobs](../images/RoleJobs.png)
+
+If you click on your `hello_world` Job, you'll see:
+
+![hello_world Job](../images/HelloWorldJob.png)
+
+Oops, looks like our first job didn't quite work! The task is temporarily 
throttled for
+having failed on every attempt of the Aurora scheduler to run it. We have to 
figure out
+what is going wrong.
+
+On the Completed tasks tab, we see all past attempts of the Aurora scheduler 
to run our job.
+
+![Completed tasks tab](../images/CompletedTasks.png)
+
+We can navigate to the Task page of a failed run by clicking on the host link.
+
+![Task page](../images/TaskBreakdown.png)
+
+Once there, we see that the `hello_world` process failed. The Task page
+captures the standard error and standard output streams and makes them 
available.
+Clicking through to `stderr` on the failed `hello_world` process, we see what 
happened.
+
+![stderr page](../images/stderr.png)
+
+It looks like we made a typo in our Python script. We wanted `xrange`,
+not `xrang`. Edit the `hello_world.py` script to use the correct function
+and save it as `hello_world_v2.py`. Then update the `hello_world.aurora`
+configuration to the newest version.
+
+In order to try again, we can now instruct the scheduler to update our job:
+
+    vagrant@aurora:~$ aurora update start 
devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
+     INFO] Starting update for: hello_world
+    Job update has started. View your update progress at 
http://aurora.local:8081/scheduler/www-data/devel/hello_world/update/8ef38017-e60f-400d-a2f2-b5a8b724e95b
+
+This time, the task comes up.
+
+![Running Job](../images/RunningJob.png)
+
+By again clicking on the host, we inspect the Task page, and see that the
+`hello_world` process is running.
+
+![Running Task page](../images/runningtask.png)
+
+We then inspect the output by clicking on `stdout` and see our process'
+output:
+
+![stdout page](../images/stdout.png)
+
+## Cleanup
+
+Now that we're done, we kill the job using the Aurora client:
+
+    vagrant@aurora:~$ aurora job killall devcluster/www-data/devel/hello_world
+     INFO] Killing tasks for job: devcluster/www-data/devel/hello_world
+     INFO] Instances to be killed: [0]
+    Successfully killed instances [0]
+    Job killall succeeded
+
+The job page now shows the `hello_world` tasks as completed.
+
+![Killed Task page](../images/killedtask.png)
+
+## Next Steps
+
+Now that you've finished this Tutorial, you should read or do the following:
+
+- [The Aurora Configuration 
Tutorial](../../reference/configuration-tutorial/), which provides more examples
+  and best practices for writing Aurora configurations. You should also look at
+  the [Aurora Configuration Reference](../../reference/configuration/).
+- Explore the Aurora Client - use `aurora -h`, and read the
+  [Aurora Client Commands](../../reference/client-commands/) document.

svn commit: r1871319 [2/5] - in /aurora/site: ./ publish/ publish/blog/ source/blog/ source/documentation/0.22.0/ source/documentation/0.22.0/additional-resources/ source/documentation/0.22.0/development/ source/documentation/0.22.0/development/design/...

Reply via email to