documentation...

renan Mon, 10 Sep 2018 22:28:47 -0700

Added: aurora/site/source/documentation/0.21.0/features/sla-metrics.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/features/sla-metrics.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/features/sla-metrics.md (added)
+++ aurora/site/source/documentation/0.21.0/features/sla-metrics.md Tue Sep 11 
05:28:10 2018
@@ -0,0 +1,215 @@
+Aurora SLA Measurement
+======================
+
+- [Overview](#overview)
+- [Metric Details](#metric-details)
+  - [Platform Uptime](#platform-uptime)
+  - [Job Uptime](#job-uptime)
+  - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\))
+  - [Median Time To Starting (MTTS)](#median-time-to-starting-\(mtts\))
+  - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\))
+- [Limitations](#limitations)
+
+## Overview
+
+The primary goal of the feature is collection and monitoring of Aurora job SLA 
(Service Level
+Agreements) metrics that defining a contractual relationship between the 
Aurora/Mesos platform
+and hosted services.
+
+The Aurora SLA feature is by default only enabled for service (non-cron)
+production jobs (`"production=True"` in your `.aurora` config). It can be 
enabled for
+non-production services by an operator via the scheduler command line flag 
`-sla_non_prod_metrics`.
+
+Counters that track SLA measurements are computed periodically within the 
scheduler.
+The individual instance metrics are refreshed every minute (configurable via
+`sla_stat_refresh_interval`). The instance counters are subsequently 
aggregated by
+relevant grouping types before exporting to scheduler `/vars` endpoint (when 
using `vagrant`
+that would be `http://192.168.33.7:8081/vars`)
+
+
+## Metric Details
+
+### Platform Uptime
+
+*Aggregate amount of time a job spends in a non-runnable state due to platform 
unavailability
+or scheduling delays. This metric tracks Aurora/Mesos uptime performance and 
reflects on any
+system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task 
kills/restarts
+will not degrade this metric.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_platform_uptime_percent`
+* Per cluster - `sla_cluster_platform_uptime_percent`
+
+**Units:** percent
+
+A fault in the task environment may cause the Aurora/Mesos to have different 
views on the task state
+or lose track of the task existence. In such cases, the service task is marked 
as LOST and
+rescheduled by Aurora. For example, this may happen when the task stays in 
ASSIGNED or STARTING
+for too long or the Mesos agent becomes unhealthy (or disappears completely). 
The time between
+task entering LOST and its replacement reaching RUNNING state is counted 
towards platform downtime.
+
+Another example of a platform downtime event is the administrator-requested 
task rescheduling. This
+happens during planned Mesos agent maintenance when all agent tasks are marked 
as DRAINED and
+rescheduled elsewhere.
+
+To accurately calculate Platform Uptime, we must separate platform incurred 
downtime from user
+actions that put a service instance in a non-operational state. It is simpler 
to isolate
+user-incurred downtime and treat all other downtime as platform incurred.
+
+Currently, a user can cause a healthy service (task) downtime in only two 
ways: via `killTasks`
+or `restartShards` RPCs. For both, their affected tasks leave an audit state 
transition trail
+relevant to uptime calculations. By applying a special "SLA meaning" to 
exposed task state
+transition records, we can build a deterministic downtime trace for every 
given service instance.
+
+A task going through a state transition carries one of three possible SLA 
meanings
+(see 
[SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.21.0/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java)
 for
+sla-to-task-state mapping):
+
+* Task is UP: starts a period where the task is considered to be up and 
running from the Aurora
+  platform standpoint.
+
+* Task is DOWN: starts a period where the task cannot reach the UP state for 
some
+  non-user-related reason. Counts towards instance downtime.
+
+* Task is REMOVED from SLA: starts a period where the task is not expected to 
be UP due to
+  user initiated action or failure. We ignore this period for the uptime 
calculation purposes.
+
+This metric is recalculated over the last sampling period (last minute) to 
account for
+any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately 
adjacent to the
+sampling interval as well as adjacent REMOVED events.
+
+### Job Uptime
+
+*Percentage of the job instances considered to be in RUNNING state for the 
specified duration
+relative to request time. This is a purely application side metric that is 
considering aggregate
+uptime of all RUNNING instances. Any user- or platform initiated restarts 
directly affect
+this metric.*
+
+**Collection scope:** We currently expose job uptime values at 5 pre-defined
+percentiles (50th,75th,90th,95th and 99th):
+
+* `sla_<job_key>_job_uptime_50_00_sec`
+* `sla_<job_key>_job_uptime_75_00_sec`
+* `sla_<job_key>_job_uptime_90_00_sec`
+* `sla_<job_key>_job_uptime_95_00_sec`
+* `sla_<job_key>_job_uptime_99_00_sec`
+
+**Units:** seconds
+You can also get customized real-time stats from aurora client. See `aurora 
sla -h` for
+more details.
+
+### Median Time To Assigned (MTTA)
+
+*Median time a job spends waiting for its tasks to be assigned to a host. This 
is a combined
+metric that helps track the dependency of scheduling performance on the 
requested resources
+(user scope) as well as the internal scheduler bin-packing algorithm 
efficiency (platform scope).*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtta_ms`
+* Per cluster - `sla_cluster_mtta_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.21.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtta_ms`
+    * `sla_cpu_medium_mtta_ms`
+    * `sla_cpu_large_mtta_ms`
+    * `sla_cpu_xlarge_mtta_ms`
+    * `sla_cpu_xxlarge_mtta_ms`
+  * By RAM:
+    * `sla_ram_small_mtta_ms`
+    * `sla_ram_medium_mtta_ms`
+    * `sla_ram_large_mtta_ms`
+    * `sla_ram_xlarge_mtta_ms`
+    * `sla_ram_xxlarge_mtta_ms`
+  * By DISK:
+    * `sla_disk_small_mtta_ms`
+    * `sla_disk_medium_mtta_ms`
+    * `sla_disk_large_mtta_ms`
+    * `sla_disk_xlarge_mtta_ms`
+    * `sla_disk_xxlarge_mtta_ms`
+
+**Units:** milliseconds
+
+MTTA only considers instances that have already reached ASSIGNED state and 
ignores those
+that are still PENDING. This ensures straggler instances (e.g. with 
unreasonable resource
+constraints) do not affect metric curves.
+
+### Median Time To Starting (MTTS)
+
+*Median time a job waits for its tasks to reach STARTING state. This is a 
comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start 
initializing the sandbox
+for a task.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtts_ms`
+* Per cluster - `sla_cluster_mtts_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.21.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtts_ms`
+    * `sla_cpu_medium_mtts_ms`
+    * `sla_cpu_large_mtts_ms`
+    * `sla_cpu_xlarge_mtts_ms`
+    * `sla_cpu_xxlarge_mtts_ms`
+  * By RAM:
+    * `sla_ram_small_mtts_ms`
+    * `sla_ram_medium_mtts_ms`
+    * `sla_ram_large_mtts_ms`
+    * `sla_ram_xlarge_mtts_ms`
+    * `sla_ram_xxlarge_mtts_ms`
+  * By DISK:
+    * `sla_disk_small_mtts_ms`
+    * `sla_disk_medium_mtts_ms`
+    * `sla_disk_large_mtts_ms`
+    * `sla_disk_xlarge_mtts_ms`
+    * `sla_disk_xxlarge_mtts_ms`
+
+**Units:** milliseconds
+
+MTTS only considers instances in STARTING state. This ensures straggler 
instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+### Median Time To Running (MTTR)
+
+*Median time a job waits for its tasks to reach RUNNING state. This is a 
comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start 
executing user content.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mttr_ms`
+* Per cluster - `sla_cluster_mttr_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.21.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mttr_ms`
+    * `sla_cpu_medium_mttr_ms`
+    * `sla_cpu_large_mttr_ms`
+    * `sla_cpu_xlarge_mttr_ms`
+    * `sla_cpu_xxlarge_mttr_ms`
+  * By RAM:
+    * `sla_ram_small_mttr_ms`
+    * `sla_ram_medium_mttr_ms`
+    * `sla_ram_large_mttr_ms`
+    * `sla_ram_xlarge_mttr_ms`
+    * `sla_ram_xxlarge_mttr_ms`
+  * By DISK:
+    * `sla_disk_small_mttr_ms`
+    * `sla_disk_medium_mttr_ms`
+    * `sla_disk_large_mttr_ms`
+    * `sla_disk_xlarge_mttr_ms`
+    * `sla_disk_xxlarge_mttr_ms`
+
+**Units:** milliseconds
+
+MTTR only considers instances in RUNNING state. This ensures straggler 
instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+## Limitations
+
+* The availability of Aurora SLA metrics is bound by the scheduler 
availability.
+
+* All metrics are calculated at a pre-defined interval (currently set at 1 
minute).
+  Scheduler restarts may result in missed collections.


Added: aurora/site/source/documentation/0.21.0/features/sla-requirements.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/features/sla-requirements.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/features/sla-requirements.md (added)
+++ aurora/site/source/documentation/0.21.0/features/sla-requirements.md Tue 
Sep 11 05:28:10 2018
@@ -0,0 +1,185 @@
+SLA Requirements
+================
+
+- [Overview](#overview)
+- [Default SLA](#default-sla)
+- [Custom SLA](#custom-sla)
+  - [Count-based](#count-based)
+  - [Percentage-based](#percentage-based)
+  - [Coordinator-based](#coordinator-based)
+
+## Overview
+
+Aurora guarantees SLA requirements for jobs. These requirements limit the 
impact of cluster-wide
+maintenance operations on the jobs. For instance, when an operator upgrades
+the OS on all the Mesos agent machines, the tasks scheduled on them needs to 
be drained.
+By specifying the SLA requirements a job can make sure that it has enough 
instances to
+continue operating safely without incurring downtime.
+
+> SLA is defined as minimum number of active tasks required for a job every 
duration window.
+A task is active if it was in `RUNNING` state during the last duration window.
+
+There is a [default](#default-sla) SLA guarantee for
+[preferred](../../features/multitenancy/#configuration-tiers) tier jobs and it 
is also possible to
+specify [custom](#custom-sla) SLA requirements.
+
+## Default SLA
+
+Aurora guarantees a default SLA requirement for tasks in
+[preferred](../../features/multitenancy/#configuration-tiers) tier.
+
+> 95% of tasks in a job will be `active` for every 30 mins.
+
+
+## Custom SLA
+
+For jobs that require different SLA requirements, Aurora allows jobs to 
specify their own
+SLA requirements via the `SlaPolicies`. There are 3 different ways to express 
SLA requirements.
+
+### [Count-based](../../reference/configuration/#countslapolicy-objects)
+
+For jobs that need a minimum `number` of instances to be running all the time,
+[`CountSlaPolicy`](../../reference/configuration/#countslapolicy-objects)
+provides the ability to express the minimum number of required active 
instances (i.e. number of
+tasks that are `RUNNING` for at least `duration_secs`). For instance, if we 
have a
+`replicated-service` that has 3 instances and needs at least 2 instances every 
30 minutes to be
+treated healthy, the SLA requirement can be expressed with a
+[`CountSlaPolicy`](../../reference/configuration/#countslapolicy-objects) like 
below,
+
+```python
+Job(
+  name = 'replicated-service',
+  role = 'www-data',
+  instances = 3,
+  sla_policy = CountSlaPolicy(
+    count = 2,
+    duration_secs = 1800
+  )
+  ...
+)
+```
+
+### 
[Percentage-based](../../reference/configuration/#percentageslapolicy-objects)
+
+For jobs that need a minimum `percentage` of instances to be running all the 
time,
+[`PercentageSlaPolicy`](../../reference/configuration/#percentageslapolicy-objects)
 provides the
+ability to express the minimum percentage of required active instances (i.e. 
percentage of tasks
+that are `RUNNING` for at least `duration_secs`). For instance, if we have a 
`webservice` that
+has 10000 instances for handling peak load and cannot have more than 0.1% of 
the instances down
+for every 1 hr, the SLA requirement can be expressed with a
+[`PercentageSlaPolicy`](../../reference/configuration/#percentageslapolicy-objects)
 like below,
+
+```python
+Job(
+  name = 'frontend-service',
+  role = 'www-data',
+  instances = 10000,
+  sla_policy = PercentageSlaPolicy(
+    percentage = 99.9,
+    duration_secs = 3600
+  )
+  ...
+)
+```
+
+### 
[Coordinator-based](../../reference/configuration/#coordinatorslapolicy-objects)
+
+When none of the above methods are enough to describe the SLA requirements for 
a job, then the SLA
+calculation can be off-loaded to a custom service called the `Coordinator`. 
The `Coordinator` needs
+to expose an endpoint that will be called to check if removal of a task will 
affect the SLA
+requirements for the job. This is useful to control the number of tasks that 
undergoes maintenance
+at a time, without affected the SLA for the application.
+
+Consider the example, where we have a `storage-service` stores 2 replicas of 
an object. Each replica
+is distributed across the instances, such that replicas are stored on 
different hosts. In addition
+a consistent-hash is used for distributing the data across the instances.
+
+When an instance needs to be drained (say for host-maintenance), we have to 
make sure that at least 1 of
+the 2 replicas remains available. In such a case, a `Coordinator` service can 
be used to maintain
+the SLA guarantees required for the job.
+
+The job can be configured with a
+[`CoordinatorSlaPolicy`](../../reference/configuration/#coordinatorslapolicy-objects)
 to specify the
+coordinator endpoint and the field in the response JSON that indicates if the 
SLA will be affected
+or not affected, when the task is removed.
+
+```python
+Job(
+  name = 'storage-service',
+  role = 'www-data',
+  sla_policy = CoordinatorSlaPolicy(
+    coordinator_url = 'http://coordinator.example.com',
+    status_key = 'drain'
+  )
+  ...
+)
+```
+
+
+#### Coordinator Interface [Experimental]
+
+When a 
[`CoordinatorSlaPolicy`](../../reference/configuration/#coordinatorslapolicy-objects)
 is
+specified for a job, any action that requires removing a task
+(such as drains) will be required to get approval from the `Coordinator` 
before proceeding. The
+coordinator service needs to expose a HTTP endpoint, that can take a 
`task-key` param
+(`<cluster>/<role>/<env>/<name>/<instance>`) and a json body describing the 
task
+details, force maintenance countdown (ms) and other params and return a 
response json that will
+contain the boolean status for allowing or disallowing the task's removal.
+
+##### Request:
+```javascript
+POST /
+  ?task=<cluster>/<role>/<env>/<name>/<instance>
+
+{
+  "forceMaintenanceCountdownMs": "604755646",
+  "task": "cluster/role/devel/job/1",
+  "taskConfig": {
+    "assignedTask": {
+      "taskId": "taskA",
+      "slaveHost": "a",
+      "task": {
+        "job": {
+          "role": "role",
+          "environment": "devel",
+          "name": "job"
+        },
+        ...
+      },
+      "assignedPorts": {
+        "http": 1000
+      },
+      "instanceId": 1
+      ...
+    },
+    ...
+  }
+}
+```
+
+##### Response:
+```json
+{
+  "drain": true
+}
+```
+
+If Coordinator allows removal of the task, then the taskâs
+[termination 
lifecycle](../../reference/configuration/#httplifecycleconfig-objects)
+is triggered. If Coordinator does not allow removal, then the request will be 
retried again in the
+future.
+
+#### Coordinator Actions
+
+Coordinator endpoint get its own lock and this is used to serializes calls to 
the Coordinator.
+It guarantees that only one concurrent request is sent to a coordinator 
endpoint. This allows
+coordinators to simply look the current state of the tasks to determine its 
SLA (without having
+to worry about in-flight and pending requests). However if there are multiple 
coordinators,
+maintenance can be done in parallel across all the coordinators.
+
+_Note: Single concurrent request to a coordinator endpoint does not translate 
as exactly-once
+guarantee. The coordinator must be able to handle duplicate drain
+requests for the same task._
+
+
+

Added: aurora/site/source/documentation/0.21.0/features/webhooks.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/features/webhooks.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/features/webhooks.md (added)
+++ aurora/site/source/documentation/0.21.0/features/webhooks.md Tue Sep 11 
05:28:10 2018
@@ -0,0 +1,112 @@
+Webhooks
+========
+
+Aurora has an optional feature which allows operator to specify a file to 
configure a HTTP webhook
+to receive task state change events. It can be enabled with a scheduler flag eg
+`-webhook_config=/path/to/webhook.json`. At this point, webhooks are still 
considered *experimental*.
+
+Below is a sample configuration:
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/";,
+  "timeoutMsec": 5
+}
+```
+
+And an example of a response that you will get back:
+
+```json
+{
+    "task":
+    {
+        "cachedHashCode":0,
+        "assignedTask": {
+            "cachedHashCode":0,
+            
"taskId":"vagrant-test-http_example-8-a6cf7ec5-d793-49c7-b10f-0e14ab80bfff",
+            "task": {
+                "cachedHashCode":-1819348376,
+                "job": {
+                    "cachedHashCode":803049425,
+                    "role":"vagrant",
+                    "environment":"test",
+                    "name":"http_example"
+                    },
+                "owner": {
+                    "cachedHashCode":226895216,
+                    "user":"vagrant"
+                    },
+                "isService":true,
+                "numCpus":0.1,
+                "ramMb":16,
+                "diskMb":8,
+                "priority":0,
+                "maxTaskFailures":1,
+                "production":false,
+                "resources":[
+                    
{"cachedHashCode":729800451,"setField":"NUM_CPUS","value":0.1},
+                    
{"cachedHashCode":552899914,"setField":"RAM_MB","value":16},
+                    
{"cachedHashCode":-1547868317,"setField":"DISK_MB","value":8},
+                    
{"cachedHashCode":1957328227,"setField":"NAMED_PORT","value":"http"},
+                    
{"cachedHashCode":1954229436,"setField":"NAMED_PORT","value":"tcp"}
+                    ],
+                "constraints":[],
+                "requestedPorts":["http","tcp"],
+                "taskLinks":{"http":"http://%host%:%port:http%"},
+                "contactEmail":"vagrant@localhost",
+                "executorConfig": {
+                    "cachedHashCode":-1194797325,
+                    "name":"AuroraExecutor",
+                    "data": "{\"environment\": \"test\", 
\"health_check_config\": {\"initial_interval_secs\": 5.0, \"health_checker\": { 
\"http\": {\"expected_response_code\": 0, \"endpoint\": \"/health\", 
\"expected_response\": \"ok\"}}, \"max_consecutive_failures\": 0, 
\"timeout_secs\": 1.0, \"interval_secs\": 1.0}, \"name\": \"http_example\", 
\"service\": true, \"max_task_failures\": 1, \"cron_collision_policy\": 
\"KILL_EXISTING\", \"enable_hooks\": false, \"cluster\": \"devcluster\", 
\"task\": {\"processes\": [{\"daemon\": false, \"name\": \"echo_ports\", 
\"ephemeral\": false, \"max_failures\": 1, \"min_duration\": 5, \"cmdline\": 
\"echo \\\"tcp port: {{thermos.ports[tcp]}}; http port: 
{{thermos.ports[http]}}; alias: {{thermos.ports[alias]}}\\\"\", \"final\": 
false}, {\"daemon\": false, \"name\": \"stage_server\", \"ephemeral\": false, 
\"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"cp 
/vagrant/src/test/sh/org/apache/aurora/e2e/http_example.py .\", \"final\": 
false}, {\
 "daemon\": false, \"name\": \"run_server\", \"ephemeral\": false, 
\"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"python http_example.py 
{{thermos.ports[http]}}\", \"final\": false}], \"name\": \"http_example\", 
\"finalization_wait\": 30, \"max_failures\": 1, \"max_concurrency\": 0, 
\"resources\": {\"disk\": 8388608, \"ram\": 16777216, \"cpu\": 0.1}, 
\"constraints\": [{\"order\": [\"echo_ports\", \"stage_server\", 
\"run_server\"]}]}, \"production\": false, \"role\": \"vagrant\", \"contact\": 
\"vagrant@localhost\", \"announce\": {\"primary_port\": \"http\", \"portmap\": 
{\"alias\": \"http\"}}, \"lifecycle\": {\"http\": 
{\"graceful_shutdown_endpoint\": \"/quitquitquit\", \"port\": \"health\", 
\"shutdown_endpoint\": \"/abortabortabort\"}}, \"priority\": 0}"},
+                    "metadata":[],
+                    "container":{
+                        "cachedHashCode":-1955376216,
+                        "setField":"MESOS",
+                        "value":{"cachedHashCode":31}}
+                    },
+                    "assignedPorts":{},
+                    "instanceId":8
+        },
+        "status":"PENDING",
+        "failureCount":0,
+        "taskEvents":[
+            
{"cachedHashCode":0,"timestamp":1464992060258,"status":"PENDING","scheduler":"aurora"}]
+        },
+        "oldState":{}}
+```
+
+By default, the webhook watches all TaskStateChanges and sends events to 
configured endpoint. If you
+are only interested in certain types of TaskStateChange (transition to `LOST` 
or `FAILED` statuses),
+you can specify a whitelist of the desired task statuses in webhook.json. The 
webhook will only send
+the corresponding events for the whitelisted statuses to the configured 
endpoint.
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/";,
+  "timeoutMsec": 50,
+  "statuses": ["LOST", "FAILED"]
+}
+```
+
+If you want to whitelist all TaskStateChanges, you can add a wildcard 
character `*` to your whitelist
+like below, or simply leave out the `statuses` field in webhook.json.
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/";,
+  "timeoutMsec": 50,
+  "statuses": ["*"]
+}
+```

Added: aurora/site/source/documentation/0.21.0/getting-started/overview.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/getting-started/overview.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/getting-started/overview.md (added)
+++ aurora/site/source/documentation/0.21.0/getting-started/overview.md Tue Sep 
11 05:28:10 2018
@@ -0,0 +1,112 @@
+Aurora System Overview
+======================
+
+Apache Aurora is a service scheduler that runs on top of Apache Mesos, 
enabling you to run
+long-running services, cron jobs, and ad-hoc jobs that take advantage of 
Apache Mesos' scalability,
+fault-tolerance, and resource isolation.
+
+
+Components
+----------
+
+It is important to have an understanding of the components that make up
+a functioning Aurora cluster.
+
+![Aurora Components](../images/components.png)
+
+* **Aurora scheduler**
+  The scheduler is your primary interface to the work you run in your cluster. 
 You will
+  instruct it to run jobs, and it will manage them in Mesos for you.  You will 
also frequently use
+  the scheduler's read-only web interface as a heads-up display for what's 
running in your cluster.
+
+* **Aurora client**
+  The client (`aurora` command) is a command line tool that exposes primitives 
that you can use to
+  interact with the scheduler. The client operates on
+
+  Aurora also provides an admin client (`aurora_admin` command) that contains 
commands built for
+  cluster administrators.  You can use this tool to do things like manage user 
quotas and manage
+  graceful maintenance on machines in cluster.
+
+* **Aurora executor**
+  The executor (a.k.a. Thermos executor) is responsible for carrying out the 
workloads described in
+  the Aurora DSL (`.aurora` files).  The executor is what actually executes 
user processes.  It will
+  also perform health checking of tasks and register tasks in ZooKeeper for 
the purposes of dynamic
+  service discovery.
+
+* **Aurora observer**
+  The observer provides browser-based access to the status of individual tasks 
executing on worker
+  machines.  It gives insight into the processes executing, and facilitates 
browsing of task sandbox
+  directories.
+
+* **ZooKeeper**
+  [ZooKeeper](http://zookeeper.apache.org) is a distributed consensus system.  
In an Aurora cluster
+  it is used for reliable election of the leading Aurora scheduler and Mesos 
master.  It is also
+  used as a vehicle for service discovery, see [Service 
Discovery](../../features/service-discovery/)
+
+* **Mesos master**
+  The master is responsible for tracking worker machines and performing 
accounting of their
+  resources.  The scheduler interfaces with the master to control the cluster.
+
+* **Mesos agent**
+  The agent receives work assigned by the scheduler and executes them.  It 
interfaces with Linux
+  isolation systems like cgroups, namespaces and Docker to manage the resource 
consumption of tasks.
+  When a user task is launched, the agent will launch the executor (in the 
context of a Linux cgroup
+  or Docker container depending upon the environment), which will in turn fork 
user processes.
+
+  In earlier versions of Mesos and Aurora, the Mesos agent was known as the 
Mesos slave.
+
+
+Jobs, Tasks and Processes
+--------------------------
+
+Aurora is a Mesos framework used to schedule *jobs* onto Mesos. Mesos
+cares about individual *tasks*, but typical jobs consist of dozens or
+hundreds of task replicas. Aurora provides a layer on top of Mesos with
+its `Job` abstraction. An Aurora `Job` consists of a task template and
+instructions for creating near-identical replicas of that task (modulo
+things like "instance id" or specific port numbers which may differ from
+machine to machine).
+
+How many tasks make up a Job is complicated. On a basic level, a Job consists 
of
+one task template and instructions for creating near-identical replicas of 
that task
+(otherwise referred to as "instances" or "shards").
+
+A task can merely be a single *process* corresponding to a single
+command line, such as `python2.7 my_script.py`. However, a task can also
+consist of many separate processes, which all run within a single
+sandbox. For example, running multiple cooperating agents together,
+such as `logrotate`, `installer`, master, or agent processes. This is
+where Thermos comes in. While Aurora provides a `Job` abstraction on
+top of Mesos `Tasks`, Thermos provides a `Process` abstraction
+underneath Mesos `Task`s and serves as part of the Aurora framework's
+executor.
+
+You define `Job`s,` Task`s, and `Process`es in a configuration file.
+Configuration files are written in Python, and make use of the
+[Pystachio](https://github.com/wickman/pystachio) templating language,
+along with specific Aurora, Mesos, and Thermos commands and methods.
+The configuration files typically end with a `.aurora` extension.
+
+Summary:
+
+* Aurora manages jobs made of tasks.
+* Mesos manages tasks made of processes.
+* Thermos manages processes.
+* All that is defined in `.aurora` configuration files
+
+![Aurora hierarchy](../images/aurora_hierarchy.png)
+
+Each `Task` has a *sandbox* created when the `Task` starts and garbage
+collected when it finishes. All of a `Task'`s processes run in its
+sandbox, so processes can share state by using a shared current working
+directory.
+
+The sandbox garbage collection policy considers many factors, most
+importantly age and size. It makes a best-effort attempt to keep
+sandboxes around as long as possible post-task in order for service
+owners to inspect data and logs, should the `Task` have completed
+abnormally. But you can't design your applications assuming sandboxes
+will be around forever, e.g. by building log saving or other
+checkpointing mechanisms directly into your application or into your
+`Job` description.
+

Added: aurora/site/source/documentation/0.21.0/getting-started/tutorial.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/getting-started/tutorial.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/getting-started/tutorial.md (added)
+++ aurora/site/source/documentation/0.21.0/getting-started/tutorial.md Tue Sep 
11 05:28:10 2018
@@ -0,0 +1,258 @@
+# Aurora Tutorial
+
+This tutorial shows how to use the Aurora scheduler to run (and 
"`printf-debug`")
+a hello world program on Mesos. This is the recommended document for new 
Aurora users
+to start getting up to speed on the system.
+
+- [Prerequisite](#setup-install-aurora)
+- [The Script](#the-script)
+- [Aurora Configuration](#aurora-configuration)
+- [Creating the Job](#creating-the-job)
+- [Watching the Job Run](#watching-the-job-run)
+- [Cleanup](#cleanup)
+- [Next Steps](#next-steps)
+
+
+## Prerequisite
+
+This tutorial assumes you are running [Aurora locally using 
Vagrant](../vagrant/).
+However, in general the instructions are also applicable to any other
+[Aurora installation](../../operations/installation/).
+
+Unless otherwise stated, all commands are to be run from the root of the aurora
+repository clone.
+
+
+## The Script
+
+Our "hello world" application is a simple Python script that loops
+forever, displaying the time every few seconds. Copy the code below and
+put it in a file named `hello_world.py` in the root of your Aurora repository 
clone
+(Note: this directory is the same as `/vagrant` inside the Vagrant VMs).
+
+The script has an intentional bug, which we will explain later on.
+
+<!-- NOTE: If you are changing this file, be sure to also update 
examples/vagrant/test_tutorial.sh.
+-->
+```python
+import time
+
+def main():
+  SLEEP_DELAY = 10
+  # Python experts - ignore this blatant bug.
+  for i in xrang(100):
+    print("Hello world! The time is now: %s. Sleeping for %d secs" % (
+      time.asctime(), SLEEP_DELAY))
+    time.sleep(SLEEP_DELAY)
+
+if __name__ == "__main__":
+  main()
+```
+
+## Aurora Configuration
+
+Once we have our script/program, we need to create a *configuration
+file* that tells Aurora how to manage and launch our Job. Save the below
+code in the file `hello_world.aurora`.
+
+<!-- NOTE: If you are changing this file, be sure to also update 
examples/vagrant/test_tutorial.sh.
+-->
+```python
+pkg_path = '/vagrant/hello_world.py'
+
+# we use a trick here to make the configuration change with
+# the contents of the file, for simplicity.  in a normal setting, packages 
would be
+# versioned, and the version number would be changed in the configuration.
+import hashlib
+with open(pkg_path, 'rb') as f:
+  pkg_checksum = hashlib.md5(f.read()).hexdigest()
+
+# copy hello_world.py into the local sandbox
+install = Process(
+  name = 'fetch_package',
+  cmdline = 'cp %s . && echo %s && chmod +x hello_world.py' % (pkg_path, 
pkg_checksum))
+
+# run the script
+hello_world = Process(
+  name = 'hello_world',
+  cmdline = 'python -u hello_world.py')
+
+# describe the task
+hello_world_task = SequentialTask(
+  processes = [install, hello_world],
+  resources = Resources(cpu = 1, ram = 1*MB, disk=8*MB))
+
+jobs = [
+  Service(cluster = 'devcluster',
+          environment = 'devel',
+          role = 'www-data',
+          name = 'hello_world',
+          task = hello_world_task)
+]
+```
+
+There is a lot going on in that configuration file:
+
+1. From a "big picture" viewpoint, it first defines two
+Processes. Then it defines a Task that runs the two Processes in the
+order specified in the Task definition, as well as specifying what
+computational and memory resources are available for them.  Finally,
+it defines a Job that will schedule the Task on available and suitable
+machines. This Job is the sole member of a list of Jobs; you can
+specify more than one Job in a config file.
+
+2. At the Process level, it specifies how to get your code into the
+local sandbox in which it will run. It then specifies how the code is
+actually run once the second Process starts.
+
+For more about Aurora configuration files, see the [Configuration
+Tutorial](../../reference/configuration-tutorial/) and the [Configuration
+Reference](../../reference/configuration/) (preferably after finishing this
+tutorial).
+
+
+## Creating the Job
+
+We're ready to launch our job! To do so, we use the Aurora Client to
+issue a Job creation request to the Aurora scheduler.
+
+Many Aurora Client commands take a *job key* argument, which uniquely
+identifies a Job. A job key consists of four parts, each separated by a
+"/". The four parts are  `<cluster>/<role>/<environment>/<jobname>`
+in that order:
+
+* Cluster refers to the name of a particular Aurora installation.
+* Role names are user accounts existing on the agent machines. If you
+don't know what accounts are available, contact your sysadmin.
+* Environment names are namespaces; you can count on `test`, `devel`,
+`staging` and `prod` existing.
+* Jobname is the custom name of your job.
+
+When comparing two job keys, if any of the four parts is different from
+its counterpart in the other key, then the two job keys identify two separate
+jobs. If all four values are identical, the job keys identify the same job.
+
+The `clusters.json` [client 
configuration](../../reference/client-cluster-configuration/)
+for the Aurora scheduler defines the available cluster names.
+For Vagrant, from the top-level of your Aurora repository clone, do:
+
+    $ vagrant ssh
+
+Followed by:
+
+    vagrant@aurora:~$ cat /etc/aurora/clusters.json
+
+You'll see something like the following. The `name` value shown here, 
corresponds to a job key's cluster value.
+
+```javascript
+[{
+  "name": "devcluster",
+  "zk": "192.168.33.7",
+  "scheduler_zk_path": "/aurora/scheduler",
+  "auth_mechanism": "UNAUTHENTICATED",
+  "slave_run_directory": "latest",
+  "slave_root": "/var/lib/mesos"
+}]
+```
+
+The Aurora Client command that actually runs our Job is `aurora job create`. 
It creates a Job as
+specified by its job key and configuration file arguments and runs it.
+
+    aurora job create <cluster>/<role>/<environment>/<jobname> <config_file>
+
+Or for our example:
+
+    aurora job create devcluster/www-data/devel/hello_world 
/vagrant/hello_world.aurora
+
+After entering our virtual machine using `vagrant ssh`, this returns:
+
+    vagrant@aurora:~$ aurora job create devcluster/www-data/devel/hello_world 
/vagrant/hello_world.aurora
+     INFO] Creating job hello_world
+     INFO] Checking status of devcluster/www-data/devel/hello_world
+    Job create succeeded: job 
url=http://aurora.local:8081/scheduler/www-data/devel/hello_world
+
+
+## Watching the Job Run
+
+Now that our job is running, let's see what it's doing. Access the
+scheduler web interface at 
`http://$scheduler_hostname:$scheduler_port/scheduler`
+Or when using `vagrant`, `http://192.168.33.7:8081/scheduler`
+First we see what Jobs are scheduled:
+
+![Scheduled Jobs](../images/ScheduledJobs.png)
+
+Click on your user name, which in this case was `www-data`, and we see the 
Jobs associated
+with that role:
+
+![Role Jobs](../images/RoleJobs.png)
+
+If you click on your `hello_world` Job, you'll see:
+
+![hello_world Job](../images/HelloWorldJob.png)
+
+Oops, looks like our first job didn't quite work! The task is temporarily 
throttled for
+having failed on every attempt of the Aurora scheduler to run it. We have to 
figure out
+what is going wrong.
+
+On the Completed tasks tab, we see all past attempts of the Aurora scheduler 
to run our job.
+
+![Completed tasks tab](../images/CompletedTasks.png)
+
+We can navigate to the Task page of a failed run by clicking on the host link.
+
+![Task page](../images/TaskBreakdown.png)
+
+Once there, we see that the `hello_world` process failed. The Task page
+captures the standard error and standard output streams and makes them 
available.
+Clicking through to `stderr` on the failed `hello_world` process, we see what 
happened.
+
+![stderr page](../images/stderr.png)
+
+It looks like we made a typo in our Python script. We wanted `xrange`,
+not `xrang`. Edit the `hello_world.py` script to use the correct function
+and save it as `hello_world_v2.py`. Then update the `hello_world.aurora`
+configuration to the newest version.
+
+In order to try again, we can now instruct the scheduler to update our job:
+
+    vagrant@aurora:~$ aurora update start 
devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
+     INFO] Starting update for: hello_world
+    Job update has started. View your update progress at 
http://aurora.local:8081/scheduler/www-data/devel/hello_world/update/8ef38017-e60f-400d-a2f2-b5a8b724e95b
+
+This time, the task comes up.
+
+![Running Job](../images/RunningJob.png)
+
+By again clicking on the host, we inspect the Task page, and see that the
+`hello_world` process is running.
+
+![Running Task page](../images/runningtask.png)
+
+We then inspect the output by clicking on `stdout` and see our process'
+output:
+
+![stdout page](../images/stdout.png)
+
+## Cleanup
+
+Now that we're done, we kill the job using the Aurora client:
+
+    vagrant@aurora:~$ aurora job killall devcluster/www-data/devel/hello_world
+     INFO] Killing tasks for job: devcluster/www-data/devel/hello_world
+     INFO] Instances to be killed: [0]
+    Successfully killed instances [0]
+    Job killall succeeded
+
+The job page now shows the `hello_world` tasks as completed.
+
+![Killed Task page](../images/killedtask.png)
+
+## Next Steps
+
+Now that you've finished this Tutorial, you should read or do the following:
+
+- [The Aurora Configuration 
Tutorial](../../reference/configuration-tutorial/), which provides more examples
+  and best practices for writing Aurora configurations. You should also look at
+  the [Aurora Configuration Reference](../../reference/configuration/).
+- Explore the Aurora Client - use `aurora -h`, and read the
+  [Aurora Client Commands](../../reference/client-commands/) document.

Added: aurora/site/source/documentation/0.21.0/getting-started/vagrant.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/getting-started/vagrant.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/getting-started/vagrant.md (added)
+++ aurora/site/source/documentation/0.21.0/getting-started/vagrant.md Tue Sep 
11 05:28:10 2018
@@ -0,0 +1,154 @@
+A local Cluster with Vagrant
+============================
+
+This document shows you how to configure a complete cluster using a virtual 
machine. This setup
+replicates a real cluster in your development machine as closely as possible. 
After you complete
+the steps outlined here, you will be ready to create and run your first Aurora 
job.
+
+The following sections describe these steps in detail:
+
+1. [Overview](#overview)
+1. [Install VirtualBox and Vagrant](#install-virtualbox-and-vagrant)
+1. [Clone the Aurora repository](#clone-the-aurora-repository)
+1. [Start the local cluster](#start-the-local-cluster)
+1. [Log onto the VM](#log-onto-the-vm)
+1. [Run your first job](#run-your-first-job)
+1. [Rebuild components](#rebuild-components)
+1. [Shut down or delete your local 
cluster](#shut-down-or-delete-your-local-cluster)
+1. [Troubleshooting](#troubleshooting)
+
+
+Overview
+--------
+
+The Aurora distribution includes a set of scripts that enable you to create a 
local cluster in
+your development machine. These scripts use 
[Vagrant](https://www.vagrantup.com/) and
+[VirtualBox](https://www.virtualbox.org/) to run and configure a virtual 
machine. Once the
+virtual machine is running, the scripts install and initialize Aurora and any 
required components
+to create the local cluster.
+
+
+Install VirtualBox and Vagrant
+------------------------------
+
+First, download and install [VirtualBox](https://www.virtualbox.org/) on your 
development machine.
+
+Then download and install [Vagrant](https://www.vagrantup.com/). To verify 
that the installation
+was successful, open a terminal window and type the `vagrant` command. You 
should see a list of
+common commands for this tool.
+
+
+Clone the Aurora repository
+---------------------------
+
+To obtain the Aurora source distribution, clone its Git repository using the 
following command:
+
+     git clone git://git.apache.org/aurora.git
+
+
+Start the local cluster
+-----------------------
+
+Now change into the `aurora/` directory, which contains the Aurora source code 
and
+other scripts and tools:
+
+     cd aurora/
+
+To start the local cluster, type the following command:
+
+     vagrant up
+
+This command uses the configuration scripts in the Aurora distribution to:
+
+* Download a Linux system image.
+* Start a virtual machine (VM) and configure it.
+* Install the required build tools on the VM.
+* Install Aurora's requirements (like [Mesos](http://mesos.apache.org/) and
+[Zookeeper](http://zookeeper.apache.org/)) on the VM.
+* Build and install Aurora from source on the VM.
+* Start Aurora's services on the VM.
+
+This process takes several minutes to complete.
+
+You may notice a warning that guest additions in the VM don't match your 
version of VirtualBox.
+This should generally be harmless, but you may wish to install a vagrant 
plugin to take care of
+mismatches like this for you:
+
+     vagrant plugin install vagrant-vbguest
+
+With this plugin installed, whenever you `vagrant up` the plugin will upgrade 
the guest additions
+for you when a version mis-match is detected. You can read more about the 
plugin
+[here](https://github.com/dotless-de/vagrant-vbguest).
+
+To verify that Aurora is running on the cluster, visit the following URLs:
+
+* Scheduler - http://192.168.33.7:8081
+* Observer - http://192.168.33.7:1338
+* Mesos Master - http://192.168.33.7:5050
+* Mesos Agent - http://192.168.33.7:5051
+
+
+Log onto the VM
+---------------
+
+To SSH into the VM, run the following command in your development machine:
+
+     vagrant ssh
+
+To verify that Aurora is installed in the VM, type the `aurora` command. You 
should see a list
+of arguments and possible commands.
+
+The `/vagrant` directory on the VM is mapped to the `aurora/` local directory
+from which you started the cluster. You can edit files inside this directory 
in your development
+machine and access them from the VM under `/vagrant`.
+
+A pre-installed `clusters.json` file refers to your local cluster as 
`devcluster`, which you
+will use in client commands.
+
+
+Run your first job
+------------------
+
+Now that your cluster is up and running, you are ready to define and run your 
first job in Aurora.
+For more information, see the [Aurora Tutorial](../tutorial/).
+
+
+Rebuild components
+------------------
+
+If you are changing Aurora code and would like to rebuild a component, you can 
use the `aurorabuild`
+command on the VM to build and restart a component.  This is considerably 
faster than destroying
+and rebuilding your VM.
+
+`aurorabuild` accepts a list of components to build and update. To get a list 
of supported
+components, invoke the `aurorabuild` command with no arguments:
+
+     vagrant ssh -c 'aurorabuild client'
+
+
+Shut down or delete your local cluster
+--------------------------------------
+
+To shut down your local cluster, run the `vagrant halt` command in your 
development machine. To
+start it again, run the `vagrant up` command.
+
+Once you are finished with your local cluster, or if you would otherwise like 
to start from scratch,
+you can use the command `vagrant destroy` to turn off and delete the virtual 
file system.
+
+
+Troubleshooting
+---------------
+
+Most of the Vagrant related problems can be fixed by the following steps:
+
+* Destroying the vagrant environment with `vagrant destroy`
+* Killing any orphaned VMs (see AURORA-499) with `virtualbox` UI or 
`VBoxManage` command line tool
+* Cleaning the repository of build artifacts and other intermediate output 
with `git clean -fdx`
+* Bringing up the vagrant environment with `vagrant up`
+
+If that still doesn't solve your problem, make sure to inspect the log files:
+
+* Scheduler: `/var/log/aurora/scheduler.log` or `sudo journalctl -u 
aurora-scheduler`
+* Observer: `/var/log/thermos/observer.log` or `sudo journalctl -u 
thermos-observer`
+* Mesos Master: `/var/log/mesos/mesos-master.INFO` (also see `.WARNING` and 
`.ERROR`)
+* Mesos Agent: `/var/log/mesos/mesos-slave.INFO` (also see `.WARNING` and 
`.ERROR`)

Added: aurora/site/source/documentation/0.21.0/images/CPUavailability.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/CPUavailability.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/CPUavailability.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/CompletedTasks.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/CompletedTasks.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/CompletedTasks.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/HelloWorldJob.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/HelloWorldJob.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/HelloWorldJob.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/RoleJobs.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/RoleJobs.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/RoleJobs.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/RunningJob.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/RunningJob.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/RunningJob.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/ScheduledJobs.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/ScheduledJobs.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/ScheduledJobs.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/TaskBreakdown.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/TaskBreakdown.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/TaskBreakdown.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/aurora_hierarchy.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/aurora_hierarchy.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/aurora_hierarchy.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/aurora_logo.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/aurora_logo.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/aurora_logo.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/components.odg
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/components.odg?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/components.odg
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/components.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/components.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/components.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/debug-client-test.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/debug-client-test.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/debug-client-test.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/debugging-client-test.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/debugging-client-test.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/debugging-client-test.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/killedtask.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/killedtask.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/killedtask.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/lifeofatask.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/lifeofatask.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/lifeofatask.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/02_28_2015_apache_aurora_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/02_28_2015_apache_aurora_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/02_28_2015_apache_aurora_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/08_21_2014_past_present_future_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/08_21_2014_past_present_future_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/08_21_2014_past_present_future_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: 
aurora/site/source/documentation/0.21.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
aurora/site/source/documentation/0.21.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/runningtask.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/runningtask.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/runningtask.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/stderr.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/stderr.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/stderr.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/stdout.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/stdout.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/stdout.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/images/storage_hierarchy.png
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/images/storage_hierarchy.png?rev=1840515&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.21.0/images/storage_hierarchy.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.21.0/index.html.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/index.html.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/index.html.md (added)
+++ aurora/site/source/documentation/0.21.0/index.html.md Tue Sep 11 05:28:10 
2018
@@ -0,0 +1,80 @@
+## Introduction
+
+Apache Aurora is a service scheduler that runs on top of Apache Mesos, 
enabling you to run
+long-running services, cron jobs, and ad-hoc jobs that take advantage of 
Apache Mesos' scalability,
+fault-tolerance, and resource isolation.
+
+We encourage you to ask questions on the [Aurora user 
list](http://aurora.apache.org/community/) or
+the `#aurora` IRC channel on `irc.freenode.net`.
+
+
+## Getting Started
+Information for everyone new to Apache Aurora.
+
+ * [Aurora System Overview](getting-started/overview/)
+ * [Hello World Tutorial](getting-started/tutorial/)
+ * [Local cluster with Vagrant](getting-started/vagrant/)
+
+## Features
+Description of important Aurora features.
+
+ * [Containers](features/containers/)
+ * [Cron Jobs](features/cron-jobs/)
+ * [Custom Executors](features/custom-executors/)
+ * [Job Updates](features/job-updates/)
+ * [Multitenancy](features/multitenancy/)
+ * [Resource Isolation](features/resource-isolation/)
+ * [Scheduling Constraints](features/constraints/)
+ * [Services](features/services/)
+ * [Service Discovery](features/service-discovery/)
+ * [SLA Metrics](features/sla-metrics/)
+ * [SLA Requirements](features/sla-requirements/)
+ * [Webhooks](features/webhooks/)
+
+## Operators
+For those that wish to manage and fine-tune an Aurora cluster.
+
+ * [Installation](operations/installation/)
+ * [Configuration](operations/configuration/)
+ * [Upgrades](operations/upgrades/)
+ * [Troubleshooting](operations/troubleshooting/)
+ * [Monitoring](operations/monitoring/)
+ * [Security](operations/security/)
+ * [Storage](operations/storage/)
+ * [Backup](operations/backup-restore/)
+
+## Reference
+The complete reference of commands, configuration options, and scheduler 
internals.
+
+ * [Task lifecycle](reference/task-lifecycle/)
+ * Configuration (`.aurora` files)
+    - [Configuration Reference](reference/configuration/)
+    - [Configuration Tutorial](reference/configuration-tutorial/)
+    - [Configuration Best Practices](reference/configuration-best-practices/)
+    - [Configuration Templating](reference/configuration-templating/)
+ * Aurora Client
+    - [Client Commands](reference/client-commands/)
+    - [Client Hooks](reference/client-hooks/)
+    - [Client Cluster Configuration](reference/client-cluster-configuration/)
+ * [Scheduler Configuration](reference/scheduler-configuration/)
+ * [Observer Configuration](reference/observer-configuration/)
+ * [Endpoints](reference/scheduler-endpoints/)
+
+## Additional Resources
+ * [Tools integrating with Aurora](additional-resources/tools/)
+ * [Presentation videos and slides](additional-resources/presentations/)
+
+## Developers
+All the information you need to start modifying Aurora and contributing back 
to the project.
+
+ * [Contributing to the project](contributing/)
+ * [Committer's Guide](development/committers-guide/)
+ * [Design Documents](development/design-documents/)
+ * Developing the Aurora components:
+     - [Client](development/client/)
+     - [Scheduler](development/scheduler/)
+     - [Scheduler UI](development/ui/)
+     - [Thermos](development/thermos/)
+     - [Thrift structures](development/thrift/)
+
+

Added: aurora/site/source/documentation/0.21.0/operations/backup-restore.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/operations/backup-restore.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/operations/backup-restore.md (added)
+++ aurora/site/source/documentation/0.21.0/operations/backup-restore.md Tue 
Sep 11 05:28:10 2018
@@ -0,0 +1,80 @@
+# Recovering from a Scheduler Backup
+
+**Be sure to read the entire page before attempting to restore from a backup, 
as it may have
+unintended consequences.**
+
+## Summary
+
+The restoration procedure replaces the existing (possibly corrupted) Mesos 
replicated log with an
+earlier, backed up, version and requires all schedulers to be taken down 
temporarily while
+restoring. Once completed, the scheduler state resets to what it was when the 
backup was created.
+This means any jobs/tasks created or updated after the backup are unknown to 
the scheduler and will
+be killed shortly after the cluster restarts. All other tasks continue 
operating as normal.
+
+Usually, it is a bad idea to restore a backup that is not extremely recent 
(i.e. older than a few
+hours). This is because the scheduler will expect the cluster to look exactly 
as the backup does,
+so any tasks that have been rescheduled since the backup was taken will be 
killed.
+
+Instructions below have been verified in [Vagrant 
environment](../../getting-started/vagrant/) and with minor
+syntax/path changes should be applicable to any Aurora cluster.
+
+Follow these steps to prepare the cluster for restoring from a backup:
+
+##  Preparation
+
+* Stop all scheduler instances.
+
+* Pick a backup to use for rehydrating the mesos-replicated log. Backups can 
be found in the
+directory given to the scheduler as the `-backup_dir` argument. Backups are 
stored in the format
+`scheduler-backup-<yyyy-MM-dd-HH-mm>`.
+
+* If running the Aurora Scheduler in HA mode, pick a single scheduler instance 
to rehydrate.
+
+* Locate the `recovery-tool` in your setup. If Aurora was installed using a 
Debian package
+generated by our `aurora-packaging` script, the recovery tool can be found
+in `/usr/share/aurora/bin/recovery-tool`.
+
+## Cleanup
+
+* Delete (or move) the Mesos replicated log path for each scheduler instance. 
The location of the
+Mesos replicated log file path can be found by looking at the value given to 
the flag
+`-native_log_file_path` for each instance.
+
+* Initialize the Mesos replicated log files using the mesos-log tool:
+```
+sudo su -u <USER> mesos-log initialize --path=<native_log_file_path>
+```
+Where `USER` is the user under which the scheduler instance will be run. For 
installations using
+Debian packages, the default user will be `aurora`. You may alternatively 
choose to specify
+a group as well by passing the `-g <GROUP>` option to `su`.
+Note that if the user under which the Aurora scheduler instance is run _does 
not_ have permissions
+to read this directory and the files it contains, the instance will fail to 
start.
+
+## Restore from backup
+
+* Run the `recovery-tool`. Wherever the flags match those used for the 
scheduler instance,
+use the same values:
+```
+$ recovery-tool -from BACKUP \
+-to LOG \
+-backup=<selected_backup_location> \
+-native_log_zk_group_path=<native_log_zk_group_path> \
+-native_log_file_path=<native_log_file_path> \
+-zk_endpoints=<zk_endpoints>
+```
+
+## Bring scheduler instances back online
+
+### If running in HA Mode
+
+* Start the rehydrated scheduler instance along with enough cleaned up 
instances to
+meet the `-native_log_quorum_size`. The mesos-replicated log algorithm will 
replenish
+the "blank" scheduler instances with the information from the rehydrated 
instance.
+
+* Start any remaining scheduler instances.
+
+### If running in singleton mode
+
+* Start the single scheduler instance.
+
+

Added: aurora/site/source/documentation/0.21.0/operations/configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.21.0/operations/configuration.md?rev=1840515&view=auto
==============================================================================
--- aurora/site/source/documentation/0.21.0/operations/configuration.md (added)
+++ aurora/site/source/documentation/0.21.0/operations/configuration.md Tue Sep 
11 05:28:10 2018
@@ -0,0 +1,380 @@
+# Scheduler Configuration
+
+The Aurora scheduler can take a variety of configuration options through 
command-line arguments.
+Examples are available under `examples/scheduler/`. For a list of available 
Aurora flags and their
+documentation, see [Scheduler Configuration 
Reference](../../reference/scheduler-configuration/).
+
+
+## A Note on Configuration
+Like Mesos, Aurora uses command-line flags for runtime configuration. As such 
the Aurora
+"configuration file" is typically a `scheduler.sh` shell script of the form.
+
+    #!/bin/bash
+    AURORA_HOME=/usr/local/aurora-scheduler
+
+    # Flags controlling the JVM.
+    JAVA_OPTS=(
+      -Xmx2g
+      -Xms2g
+      # GC tuning, etc.
+    )
+
+    # Flags controlling the scheduler.
+    AURORA_FLAGS=(
+      # Port for client RPCs and the web UI
+      -http_port=8081
+      # Log configuration, etc.
+    )
+
+    # Environment variables controlling libmesos
+    export JAVA_HOME=...
+    export GLOG_v=1
+    export LIBPROCESS_PORT=8083
+    export LIBPROCESS_IP=192.168.33.7
+
+    JAVA_OPTS="${JAVA_OPTS[*]}" exec "$AURORA_HOME/bin/aurora-scheduler" 
"${AURORA_FLAGS[@]}"
+
+That way Aurora's current flags are visible in `ps` and in the `/vars` admin 
endpoint.
+
+
+## JVM Configuration
+
+JVM settings are dependent on your environment and cluster size. They might 
require
+custom tuning. As a starting point, we recommend:
+
+* Ensure the initial (`-Xms`) and maximum (`-Xmx`) heap size are idential to 
prevent heap resizing
+  at runtime.
+* Either `-XX:+UseConcMarkSweepGC` or `-XX:+UseG1GC 
-XX:+UseStringDeduplication` are
+  sane defaults for the garbage collector.
+* `-Djava.net.preferIPv4Stack=true` makes sense in most cases as well.
+
+
+## Network Configuration
+
+By default, Aurora binds to all interfaces and auto-discovers its hostname. To 
reduce ambiguity
+it helps to hardcode them though:
+
+    -http_port=8081
+    -ip=192.168.33.7
+    -hostname="aurora1.us-east1.example.org"
+
+Two environment variables control the ip and port for the communication with 
the Mesos master
+and for the replicated log used by Aurora:
+
+    export LIBPROCESS_PORT=8083
+    export LIBPROCESS_IP=192.168.33.7
+
+It is important that those can be reached from all Mesos master and Aurora 
scheduler instances.
+
+
+## Replicated Log Configuration
+
+Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. 
Only one scheduler is
+leader at a given time - the other schedulers follow log writes and prepare to 
take over as leader
+but do not communicate with the Mesos master. Either 3 or 5 schedulers are 
recommended in a
+production deployment depending on failure tolerance and they must have 
persistent storage.
+
+Below is a summary of scheduler storage configuration flags that either don't 
have default values
+or require attention before deploying in a production environment.
+
+### `-native_log_quorum_size`
+Defines the Mesos replicated log quorum size. In a cluster with `N` 
schedulers, the flag
+`-native_log_quorum_size` should be set to `floor(N/2) + 1`. So in a cluster 
with 1 scheduler
+it should be set to `1`, in a cluster with 3 it should be set to `2`, and in a 
cluster of 5 it
+should be set to `3`.
+
+  Number of schedulers (N) | ```-native_log_quorum_size``` setting 
(```floor(N/2) + 1```)
+  ------------------------ | 
-------------------------------------------------------------
+  1                        | 1
+  3                        | 2
+  5                        | 3
+  7                        | 4
+
+*Incorrectly setting this flag will cause data corruption to occur!*
+
+### `-native_log_file_path`
+Location of the Mesos replicated log files. For optimal and consistent 
performance, consider
+allocating a dedicated disk (preferably SSD) for the replicated log. Ensure 
that this disk is not
+used by anything else (e.g. no process logging) and in particular that it is a 
real disk
+and not just a partition.
+
+Even when a dedicated disk is used, switching from `CFQ` to `deadline` I/O 
scheduler of Linux kernel
+can furthermore help with storage performance in Aurora ([see this ticket for 
details](https://issues.apache.org/jira/browse/AURORA-1211)).
+
+### `-native_log_zk_group_path`
+ZooKeeper path used for Mesos replicated log quorum discovery.
+
+See 
[code](https://github.com/apache/aurora/blob/rel/0.21.0/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java)
 for
+other available Mesos replicated log configuration options and default values.
+
+### Changing the Quorum Size
+Special care needs to be taken when changing the size of the Aurora scheduler 
quorum.
+Since Aurora uses a Mesos replicated log, similar steps need to be followed as 
when
+[changing the Mesos quorum 
size](http://mesos.apache.org/documentation/latest/operational-guide).
+
+As a preparation, increase `-native_log_quorum_size` on each existing 
scheduler and restart them.
+When updating from 3 to 5 schedulers, the quorum size would grow from 2 to 3.
+
+When starting the new schedulers, use the `-native_log_quorum_size` set to the 
new value. Failing to
+first increase the quorum size on running schedulers can in some cases result 
in corruption
+or truncating of the replicated log used by Aurora. In that case, see the 
documentation on
+[recovering from backup](../backup-restore/).
+
+
+## Backup Configuration
+
+Configuration options for the Aurora scheduler backup manager.
+
+* `-backup_interval`: The interval on which the scheduler writes local storage 
backups.
+   The default is every hour.
+* `-backup_dir`: Directory to write backups to. As stated above, this should 
not be co-located on the
+   same disk as the replicated log.
+* `-max_saved_backups`: Maximum number of backups to retain before deleting 
the oldest backup(s).
+
+
+## Resource Isolation
+
+For proper CPU, memory, and disk isolation as mentioned in our [enduser 
documentation](../../features/resource-isolation/),
+we recommend to add the following isolators to the `--isolation` flag of the 
Mesos agent:
+
+* `cgroups/cpu`
+* `cgroups/mem`
+* `disk/du`
+
+In addition, we recommend to set the following [agent 
flags](http://mesos.apache.org/documentation/latest/configuration/):
+
+* `--cgroups_limit_swap` to enable memory limits on both memory and swap 
instead of just memory.
+  Alternatively, you could disable swap on your agent hosts.
+* `--cgroups_enable_cfs` to enable hard limits on CPU resources via the CFS 
bandwidth limiting
+  feature.
+* `--enforce_container_disk_quota` to enable disk quota enforcement for 
containers.
+
+To enable the optional GPU support in Mesos, please see the GPU related flags 
in the
+[Mesos 
configuration](http://mesos.apache.org/documentation/latest/configuration/).
+To enable the corresponding feature in Aurora, you have to start the scheduler 
with the
+flag
+
+    -allow_gpu_resource=true
+
+If you want to use revocable resources, first follow the
+[Mesos oversubscription 
documentation](http://mesos.apache.org/documentation/latest/oversubscription/)
+and then set set this Aurora scheduler flag to allow receiving revocable Mesos 
offers:
+
+    -receive_revocable_resources=true
+
+Both CPUs and RAM are supported as revocable resources. The former is enabled 
by the default,
+the latter needs to be enabled via:
+
+    -enable_revocable_ram=true
+
+Unless you want to use the 
[default](https://github.com/apache/aurora/blob/rel/0.21.0/src/main/resources/org/apache/aurora/scheduler/tiers.json)
+tier configuration, you will also have to specify a file path:
+
+    -tier_config=path/to/tiers/config.json
+
+
+## Multi-Framework Setup
+
+Aurora holds onto Mesos offers in order to provide efficient scheduling and
+[preemption](../../features/multitenancy/#preemption). This is problematic in 
multi-framework
+environments as Aurora might starve other frameworks.
+
+With a downside of increased scheduling latency, Aurora can be configured to 
be more cooperative:
+
+* Lowering `-min_offer_hold_time` (e.g. to `1mins`) can ensure unused offers 
are returned back to
+  Mesos more frequently.
+* Increasing `-offer_filter_duration` (e.g to `30secs`) will instruct Mesos
+  not to re-offer rejected resources for the given duration.
+
+Setting a [minimum amount of 
resources](http://mesos.apache.org/documentation/latest/quota/) for
+each Mesos role can furthermore help to ensure no framework is starved 
entirely.
+
+
+## Containers
+
+Both the Mesos and Docker containerizers require configuration of the Mesos 
agent.
+
+### Mesos Containerizer
+
+The minimal agent configuration requires to enable Docker and Appc image 
support for the Mesos
+containerizer:
+
+    --containerizers=mesos
+    --image_providers=appc,docker
+    --isolation=filesystem/linux,docker/runtime  # as an addition to your 
other isolators
+
+Further details can be found in the corresponding [Mesos 
documentation](http://mesos.apache.org/documentation/latest/container-image/).
+
+### Docker Containerizer
+
+The [Docker 
containerizer](http://mesos.apache.org/documentation/latest/docker-containerizer/)
+requires the Docker engine is installed on each agent host. In addition, it  
must be enabled on the
+Mesos agents by launching them with the option:
+
+    --containerizers=mesos,docker
+
+If you would like to run a container with a read-only filesystem, it may also 
be necessary to use
+the scheduler flag `-thermos_home_in_sandbox` in order to set HOME to the 
sandbox
+before the executor runs. This will make sure that the executor/runner PEX 
extractions happens
+inside of the sandbox instead of the container filesystem root.
+
+If you would like to supply your own parameters to `docker run` when launching 
jobs in docker
+containers, you may use the following flags:
+
+    -allow_docker_parameters
+    -default_docker_parameters
+
+`-allow_docker_parameters` controls whether or not users may pass their own 
configuration parameters
+through the job configuration files. If set to `false` (the default), the 
scheduler will reject
+jobs with custom parameters. *NOTE*: this setting should be used with caution 
as it allows any job
+owner to specify any parameters they wish, including those that may introduce 
security concerns
+(`privileged=true`, for example).
+
+`-default_docker_parameters` allows a cluster operator to specify a universal 
set of parameters that
+should be used for every container that does not have parameters explicitly 
configured at the job
+level. The argument accepts a multimap format:
+
+    -default_docker_parameters="read-only=true,tmpfs=/tmp,tmpfs=/run"
+
+### Common Options
+
+The following Aurora options work for both containerizers.
+
+A scheduler flag, `-global_container_mounts` allows mounting paths from the 
host (i.e the agent machine)
+into all containers on that host. The format is a comma separated list of 
host_path:container_path[:mode]
+tuples. For example 
`-global_container_mounts=/opt/secret_keys_dir:/mnt/secret_keys_dir:ro` mounts
+`/opt/secret_keys_dir` from the agents into all launched containers. Valid 
modes are `ro` and `rw`.
+
+
+## Thermos Process Logs
+
+### Log destination
+By default, Thermos will write process stdout/stderr to log files in the 
sandbox. Process object
+configuration allows specifying alternate log file destinations like streamed 
stdout/stderr or
+suppression of all log output. Default behavior can be configured for the 
entire cluster with the
+following flag (through the `-thermos_executor_flags` argument to the Aurora 
scheduler):
+
+    --runner-logger-destination=both
+
+`both` configuration will send logs to files and stream to parent 
stdout/stderr outputs.
+
+See [Configuration Reference](../../reference/configuration/#logger) for all 
destination options.
+
+### Log rotation
+By default, Thermos will not rotate the stdout/stderr logs from child 
processes and they will grow
+without bound. An individual user may change this behavior via configuration 
on the Process object,
+but it may also be desirable to change the default configuration for the 
entire cluster.
+In order to enable rotation by default, the following flags can be applied to 
Thermos (through the
+`-thermos_executor_flags` argument to the Aurora scheduler):
+
+    --runner-logger-mode=rotate
+    --runner-rotate-log-size-mb=100
+    --runner-rotate-log-backups=10
+
+In the above example, each instance of the Thermos runner will rotate 
stderr/stdout logs once they
+reach 100 MiB in size and keep a maximum of 10 backups. If a user has provided 
a custom setting for
+their process, it will override these default settings.
+
+
+## Thermos Executor Wrapper
+
+If you need to do computation before starting the Thermos executor (for 
example, setting a different
+`--announcer-hostname` parameter for every executor), then the Thermos 
executor should be invoked
+inside a wrapper script. In such a case, the aurora scheduler should be 
started with
+`-thermos_executor_path` pointing to the wrapper script and 
`-thermos_executor_resources` set to a
+comma separated string of all the resources that should be copied into the 
sandbox (including the
+original Thermos executor). Ensure the wrapper script does not access 
resources outside of the
+sandbox, as when the script is run from within a Docker container those 
resources may not exist.
+
+For example, to wrap the executor inside a simple wrapper, the scheduler will 
be started like this
+`-thermos_executor_path=/path/to/wrapper.sh 
-thermos_executor_resources=/usr/share/aurora/bin/thermos_executor.pex`
+
+## Custom Executors
+
+The scheduler can be configured to utilize a custom executor by specifying the 
`-custom_executor_config` flag.
+The flag must be set to the path of a valid executor configuration file.
+
+For more information on this feature please see the custom executors 
[documentation](../../features/custom-executors/).
+
+## A note on increasing executor overhead
+
+Increasing executor overhead on an existing cluster, whether it be for custom 
executors or for Thermos,
+will result in degraded preemption performance until all task which began life 
with the previous
+executor configuration with less overhead are preempted/restarted.
+
+## Controlling MTTA via Update Affinity
+
+When there is high resource contention in your cluster you may experience 
noticably elevated job update
+times, as well as high task churn across the cluster. This is due to Aurora's 
first-fit scheduling
+algorithm. To alleviate this, you can enable update affinity where the 
Scheduler will make a best-effort
+attempt to reuse the same agent for the updated task (so long as the resources 
for the job are not being
+increased).
+
+To enable this in the Scheduler, you can set the following options:
+
+    -enable_update_affinity=true
+    -update_affinity_reservation_hold_time=3mins
+
+You will need to tune the hold time to match the behavior you see in your 
cluster. If you have extremely
+high update throughput, you might have to extend it as processing updates 
could easily add significant
+delays between scheduling attempts. You may also have to tune scheduling 
parameters to achieve the
+throughput you need in your cluster. Some relevant settings (with defaults) 
are:
+
+    -max_schedule_attempts_per_sec=40
+    -initial_schedule_penalty=1secs
+    -max_schedule_penalty=1mins
+    -scheduling_max_batch_size=3
+    -max_tasks_per_schedule_attempt=5
+
+There are metrics exposed by the Scheduler which can provide guidance on where 
the bottleneck is.
+Example metrics to look at:
+
+    - schedule_attempts_blocks (if this number is greater than 0, then task 
throughput is hitting
+                                limits controlled by 
--max_scheduler_attempts_per_sec)
+    - scheduled_task_penalty_* (metrics around scheduling penalties for tasks, 
if the numbers here are high
+                                then you could have high contention for 
resources)
+
+Most likely you'll run into limits with the number of update instances that 
can be processed per minute
+before you run into any other limits. So if your total work done per minute 
starts to exceed 2k instances,
+you may need to extend the update_affinity_reservation_hold_time.
+
+## Cluster Maintenance
+
+Aurora performs maintenance related task drains. One of the scheduler options 
that can control
+how often the scheduler polls for maintenance work can be controlled via,
+
+    -host_maintenance_polling_interval=1min
+
+## Enforcing SLA limitations
+
+Since tasks can specify their own `SLAPolicy`, the cluster needs to limit 
these SLA requirements.
+Too aggressive a requirement can permanently block any type of maintenance work
+(ex: OS/Kernel/Security upgrades) on a host and hold it hostage.
+
+An operator can control the limits for SLA requirements via these scheduler 
configuration options:
+
+    -max_sla_duration_secs=2hrs
+    -min_required_instances_for_sla_check=20
+
+_Note: These limits only apply for `CountSlaPolicy` and `PercentageSlaPolicy`._
+
+### Limiting Coordinator SLA
+
+With `CoordinatorSlaPolicy` the SLA calculation is off-loaded to an external 
HTTP service. Some
+relevant scheduler configuration options are,
+
+    -sla_coordinator_timeout=1min
+    -max_parallel_coordinated_maintenance=10
+
+Since handing off the SLA calculation to an external service can potentially 
block maintenance
+on hosts for an indefinite amount of time (either due to a mis-configured 
coordinator or due to
+a valid degraded service). In those situations the following metrics will be 
helpful to identify the
+offending tasks.
+
+    sla_coordinator_user_errors_*     (counter tracking number of times the 
coordinator for the task
+                                       returned a bad response.)
+    sla_coordinator_errors_*          (counter tracking number of times the 
scheduler was not able
+                                       to communicate with the coordinator of 
the task.)
+    sla_coordinator_lock_starvation_* (counter tracking number of times the 
scheduler was not able to
+                                       get the lock for the coordinator of the 
task.)
+

svn commit: r1840515 [12/15] - in /aurora/site: publish/blog/aurora-0-21-0-released/ publish/documentation/0.21.0/ publish/documentation/0.21.0/additional-resources/ publish/documentation/0.21.0/additional-resources/presentations/ publish/documentation...

Reply via email to