0....

renan Mon, 10 Sep 2018 22:27:03 -0700

Modified: aurora/site/source/community.html.md.erb
URL: 
http://svn.apache.org/viewvc/aurora/site/source/community.html.md.erb?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/community.html.md.erb (original)
+++ aurora/site/source/community.html.md.erb Tue Sep 11 05:25:44 2018
@@ -4,8 +4,10 @@
   <div class="col-md-4">
     <h3>Contributing</h3>
     <h4 name="reportbugs">Report or track a bug</h4>
-    <p>Bugs can be reported on our <a 
href="http://issues.apache.org/jira/browse/AURORA";>JIRA</a>.
-       In order to create a new issue, you'll need register for an account.</p>
+    <p>Bugs can be reported on our <a 
href="http://issues.apache.org/jira/browse/AURORA";>JIRA</a>
+       or raise an issue on our <a 
href="https://github.com/apache/aurora/issues";>GitHub</a> repository.</p>
+    <p>In order to create a new issue on JIRA you'll need register for an 
accountr. A GitHub account is
+       requried to raise issues on our repository.</p>
 
     <h4 name="contribute">Submit a patch</h4>
     <p>Please read our <a 
href="/documentation/latest/contributing/">contribution guide</a>
@@ -18,9 +20,6 @@
        <code>mesos.slack.com</code></a>.</p>
     <p>To request an invite for slack please click <a 
href="https://mesos-slackin.herokuapp.com/";>here</a>.</p>
     <p>All slack communication is publicly archived <a 
href="http://mesos.slackarchive.io/aurora/";>here</a>.</p>
-    <h4 name="ircchannel">IRC</h4>
-    <p>There is also a two way mirror between Slack and IRC via the #aurora 
channel on <code>irc.freenode.net</code>.
-       If you're new to IRC, we suggest trying a <a 
href="http://webchat.freenode.net/?channels=#aurora";>web-based client</a>.</p>
   </div>
   <div class="col-md-4">
     <h3>Mailing lists</h3>


Modified: aurora/site/source/documentation/latest/additional-resources/tools.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/additional-resources/tools.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/additional-resources/tools.md 
(original)
+++ aurora/site/source/documentation/latest/additional-resources/tools.md Tue 
Sep 11 05:25:44 2018
@@ -21,4 +21,4 @@ Various tools integrate with Aurora. Is
   - [aurora-packaging](https://github.com/apache/aurora-packaging), the source 
of the official Aurora packages
 
 * Thrift Clients:
-  - [gorealis](https://github.com/rdelval/gorealis) for communicating with the 
scheduler using Go
+  - [gorealis](https://github.com/paypal/gorealis) for communicating with the 
scheduler using Go

Modified: aurora/site/source/documentation/latest/contributing.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/contributing.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/contributing.md (original)
+++ aurora/site/source/documentation/latest/contributing.md Tue Sep 11 05:25:44 
2018
@@ -2,7 +2,7 @@
 
 First things first, you'll need the source! The Aurora source is available 
from Apache git:
 
-    git clone https://git-wip-us.apache.org/repos/asf/aurora
+    git clone https://gitbox.apache.org/repos/asf/aurora
 
 Read the Style Guides
 ---------------------
@@ -36,8 +36,8 @@ Post a review with `rbt`, fill out the f
 
     ./rbt post -o
 
-If you're unsure about who to add as a reviewer, you can default to adding 
Zameer Manji (zmanji) and
-Joshua Cohen (jcohen). They will take care of finding an appropriate reviewer 
for the patch.
+If you're unsure about who to add as a reviewer, you can default to adding 
Stephan Erb (StephanErb) and
+Renan DelValle (rdelvalle). They will take care of finding an appropriate 
reviewer for the patch.
 
 Once you've done this, you probably want to mark the associated Jira issue as 
Reviewable.
 

Modified: 
aurora/site/source/documentation/latest/development/committers-guide.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/committers-guide.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/committers-guide.md 
(original)
+++ aurora/site/source/documentation/latest/development/committers-guide.md Tue 
Sep 11 05:25:44 2018
@@ -29,7 +29,7 @@ and that key will need to be added to ou
 
 2. Add your gpg key to the Apache Aurora KEYS file:
 
-               git clone https://git-wip-us.apache.org/repos/asf/aurora.git
+               git clone https://gitbox.apache.org/repos/asf/aurora
                (gpg --list-sigs <KEY ID> && gpg --armor --export <KEY ID>) >> 
KEYS
                git add KEYS && git commit -m "Adding gpg key for <APACHE ID>"
                ./rbt post -o -g

Modified: aurora/site/source/documentation/latest/development/db-migration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/db-migration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/db-migration.md 
(original)
+++ aurora/site/source/documentation/latest/development/db-migration.md Tue Sep 
11 05:25:44 2018
@@ -14,7 +14,7 @@ When adding or altering tables or changi
 
[schema.sql](../../src/main/resources/org/apache/aurora/scheduler/storage/db/schema.sql),
 a new
 migration class should be created under the 
org.apache.aurora.scheduler.storage.db.migration
 package. The class should implement the 
[MigrationScript](https://github.com/mybatis/migrations/blob/master/src/main/java/org/apache/ibatis/migration/MigrationScript.java)
-interface (see 
[V001_TestMigration](https://github.com/apache/aurora/blob/rel/0.20.0/src/test/java/org/apache/aurora/scheduler/storage/db/testmigration/V001_TestMigration.java)
+interface (see 
[V001_TestMigration](https://github.com/apache/aurora/blob/master/src/test/java/org/apache/aurora/scheduler/storage/db/testmigration/V001_TestMigration.java)
 as an example). The upgrade and downgrade scripts are defined in this class. 
When restoring a
 snapshot the list of migrations on the classpath is compared to the list of 
applied changes in the
 DB. Any changes that have not yet been applied are executed and their 
downgrade script is stored

Modified: aurora/site/source/documentation/latest/development/thrift.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/thrift.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/thrift.md (original)
+++ aurora/site/source/documentation/latest/development/thrift.md Tue Sep 11 
05:25:44 2018
@@ -6,7 +6,7 @@ client/server RPC protocol as well as fo
 correctly handling additions and renames of the existing members, field 
removals must be done
 carefully to ensure backwards compatibility and provide predictable 
deprecation cycle. This
 document describes general guidelines for making Thrift schema changes to the 
existing fields in
-[api.thrift](https://github.com/apache/aurora/blob/rel/0.20.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+[api.thrift](https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
 
 It is highly recommended to go through the
 [Thrift: The Missing 
Guide](http://diwakergupta.github.io/thrift-missing-guide/) first to refresh on
@@ -33,7 +33,7 @@ communicate with scheduler/client from v
 * Add a new field as an eventual replacement of the old one and implement a 
dual read/write
 anywhere the old field is used. If a thrift struct is mapped in the DB store 
make sure both columns
 are marked as `NOT NULL`
-* Check 
[storage.thrift](https://github.com/apache/aurora/blob/rel/0.20.0/api/src/main/thrift/org/apache/aurora/gen/storage.thrift)
 to see if
+* Check 
[storage.thrift](https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/storage.thrift)
 to see if
 the affected struct is stored in Aurora scheduler storage. If so, it's almost 
certainly also
 necessary to perform a [DB migration](../db-migration/).
 * Add a deprecation jira ticket into the vCurrent+1 release candidate

Modified: aurora/site/source/documentation/latest/features/job-updates.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/features/job-updates.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/features/job-updates.md (original)
+++ aurora/site/source/documentation/latest/features/job-updates.md Tue Sep 11 
05:25:44 2018
@@ -70,7 +70,7 @@ acknowledging ("heartbeating") job updat
 service updates where explicit job health monitoring is vital during the 
entire job update
 lifecycle. Such job updates would rely on an external service (or a custom 
client) periodically
 pulsing an active coordinated job update via a
-[pulseJobUpdate 
RPC](https://github.com/apache/aurora/blob/rel/0.20.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+[pulseJobUpdate 
RPC](https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
 
 A coordinated update is defined by setting a positive
 [pulse_interval_secs](../../reference/configuration/#updateconfig-objects) 
value in job configuration
@@ -84,6 +84,19 @@ progress until the first pulse arrives.
 provided the pulse interval has not expired.
 
 
+SLA-Aware Updates
+-----------------
+
+Updates can take advantage of [Custom SLA 
Requirements](../../features/sla-requirements/) and
+specify the `sla_aware=True` option within
+[UpdateConfig](../../reference/configuration/#updateconfig-objects) to only 
update instances if
+the action will maintain the task's SLA requirements. This feature allows 
updates to avoid killing
+too many instances in the face of unexpected failures outside of the update 
range.
+
+See the [Using the `sla_aware` 
option](../../reference/configuration/#using-the-sla-aware-option)
+for more information on how to use this feature.
+
+
 Canary Deployments
 ------------------
 

Modified: aurora/site/source/documentation/latest/features/sla-metrics.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/features/sla-metrics.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/features/sla-metrics.md (original)
+++ aurora/site/source/documentation/latest/features/sla-metrics.md Tue Sep 11 
05:25:44 2018
@@ -63,7 +63,7 @@ relevant to uptime calculations. By appl
 transition records, we can build a deterministic downtime trace for every 
given service instance.
 
 A task going through a state transition carries one of three possible SLA 
meanings
-(see 
[SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java)
 for
+(see 
[SlaAlgorithm.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java)
 for
 sla-to-task-state mapping):
 
 * Task is UP: starts a period where the task is considered to be up and 
running from the Aurora
@@ -110,7 +110,7 @@ metric that helps track the dependency o
 * Per job - `sla_<job_key>_mtta_ms`
 * Per cluster - `sla_cluster_mtta_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtta_ms`
     * `sla_cpu_medium_mtta_ms`
@@ -147,7 +147,7 @@ for a task.*
 * Per job - `sla_<job_key>_mtts_ms`
 * Per cluster - `sla_cluster_mtts_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtts_ms`
     * `sla_cpu_medium_mtts_ms`
@@ -182,7 +182,7 @@ reflecting on the overall time it takes
 * Per job - `sla_<job_key>_mttr_ms`
 * Per cluster - `sla_cluster_mttr_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mttr_ms`
     * `sla_cpu_medium_mttr_ms`

Modified: aurora/site/source/documentation/latest/index.html.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/index.html.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/index.html.md (original)
+++ aurora/site/source/documentation/latest/index.html.md Tue Sep 11 05:25:44 
2018
@@ -28,6 +28,7 @@ Description of important Aurora features
  * [Services](features/services/)
  * [Service Discovery](features/service-discovery/)
  * [SLA Metrics](features/sla-metrics/)
+ * [SLA Requirements](features/sla-requirements/)
  * [Webhooks](features/webhooks/)
 
 ## Operators

Modified: aurora/site/source/documentation/latest/operations/backup-restore.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/backup-restore.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/operations/backup-restore.md 
(original)
+++ aurora/site/source/documentation/latest/operations/backup-restore.md Tue 
Sep 11 05:25:44 2018
@@ -18,74 +18,63 @@ so any tasks that have been rescheduled
 Instructions below have been verified in [Vagrant 
environment](../../getting-started/vagrant/) and with minor
 syntax/path changes should be applicable to any Aurora cluster.
 
-## Preparation
-
 Follow these steps to prepare the cluster for restoring from a backup:
 
-* Stop all scheduler instances
+##  Preparation
 
-* Consider blocking external traffic on a port defined in `-http_port` for all 
schedulers to
-prevent users from interacting with the scheduler during the restoration 
process. This will help
-troubleshooting by reducing the scheduler log noise and prevent users from 
making changes that will
-be erased after the backup snapshot is restored.
-
-* Configure `aurora_admin` access to run all commands listed in
-  [Restore from backup](#restore-from-backup) section locally on the leading 
scheduler:
-  * Make sure the 
[clusters.json](../../reference/client-cluster-configuration/) file configured 
to
-    access scheduler directly. Set `scheduler_uri` setting and remove `zk`. 
Since leader can get
-    re-elected during the restore steps, consider doing it on all scheduler 
replicas.
-  * Depending on your particular security approach you will need to either 
turn off scheduler
-    authorization by removing scheduler `-http_authentication_mechanism` flag 
or make sure the
-    direct scheduler access is properly authorized. E.g.: in case of Kerberos 
you will need to make
-    a `/etc/hosts` file change to match your local IP to the scheduler URL 
configured in keytabs:
-
-        <local_ip> <scheduler_domain_in_keytabs>
-
-* Next steps are required to put scheduler into a partially disabled state 
where it would still be
-able to accept storage recovery requests but unable to schedule or change task 
states. This may be
-accomplished by updating the following scheduler configuration options:
-  * Set `-mesos_master_address` to a non-existent zk address. This will 
prevent scheduler from
-    registering with Mesos. E.g.: 
`-mesos_master_address=zk://localhost:1111/mesos/master`
-  * `-max_registration_delay` - set to sufficiently long interval to prevent 
registration timeout
-    and as a result scheduler suicide. E.g: `-max_registration_delay=360mins`
-  * Make sure `-reconciliation_initial_delay` option is set high enough (e.g.: 
`365days`) to
-    prevent accidental task GC. This is important as scheduler will attempt to 
reconcile the cluster
-    state and will kill all tasks when restarted with an empty Mesos 
replicated log.
-
-* Restart all schedulers
-
-## Cleanup and re-initialize Mesos replicated log
-
-Get rid of the corrupted files and re-initialize Mesos replicated log:
-
-* Stop schedulers
-* Delete all files under `-native_log_file_path` on all schedulers
-* Initialize Mesos replica's log file: `sudo mesos-log initialize 
--path=<-native_log_file_path>`
-* Start schedulers
+* Stop all scheduler instances.
 
-## Restore from backup
+* Pick a backup to use for rehydrating the mesos-replicated log. Backups can 
be found in the
+directory given to the scheduler as the `-backup_dir` argument. Backups are 
stored in the format
+`scheduler-backup-<yyyy-MM-dd-HH-mm>`.
 
-At this point the scheduler is ready to rehydrate from the backup:
+* If running the Aurora Scheduler in HA mode, pick a single scheduler instance 
to rehydrate.
 
-* Identify the leading scheduler by:
-  * examining the `scheduler_lifecycle_LEADER_AWAITING_REGISTRATION` metric at 
the scheduler
-    `/vars` endpoint. Leader will have 1. All other replicas - 0.
-  * examining scheduler logs
-  * or examining Zookeeper registration under the path defined by 
`-zk_endpoints`
-    and `-serverset_path`
-
-* Locate the desired backup file, copy it to the leading scheduler's 
`-backup_dir` folder and stage
-recovery by running the following command on a leader
-`aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> 
scheduler-backup-<yyyy-MM-dd-HH-mm>`
-
-* At this point, the recovery snapshot is staged and available for manual 
verification/modification
-via `aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect` and
-`scheduler_delete_recovery_tasks --bypass-leader-redirect` commands.
-See `aurora_admin help <command>` for usage details.
-
-* Commit recovery. This instructs the scheduler to overwrite the existing 
Mesos replicated log with
-the provided backup snapshot and initiate a mandatory failover
-`aurora_admin scheduler_commit_recovery --bypass-leader-redirect  <cluster>`
+* Locate the `recovery-tool` in your setup. If Aurora was installed using a 
Debian package
+generated by our `aurora-packaging` script, the recovery tool can be found
+in `/usr/share/aurora/bin/recovery-tool`.
 
 ## Cleanup
-Undo any modification done during [Preparation](#preparation) sequence.
+
+* Delete (or move) the Mesos replicated log path for each scheduler instance. 
The location of the
+Mesos replicated log file path can be found by looking at the value given to 
the flag
+`-native_log_file_path` for each instance.
+
+* Initialize the Mesos replicated log files using the mesos-log tool:
+```
+sudo su -u <USER> mesos-log initialize --path=<native_log_file_path>
+```
+Where `USER` is the user under which the scheduler instance will be run. For 
installations using
+Debian packages, the default user will be `aurora`. You may alternatively 
choose to specify
+a group as well by passing the `-g <GROUP>` option to `su`.
+Note that if the user under which the Aurora scheduler instance is run _does 
not_ have permissions
+to read this directory and the files it contains, the instance will fail to 
start.
+
+## Restore from backup
+
+* Run the `recovery-tool`. Wherever the flags match those used for the 
scheduler instance,
+use the same values:
+```
+$ recovery-tool -from BACKUP \
+-to LOG \
+-backup=<selected_backup_location> \
+-native_log_zk_group_path=<native_log_zk_group_path> \
+-native_log_file_path=<native_log_file_path> \
+-zk_endpoints=<zk_endpoints>
+```
+
+## Bring scheduler instances back online
+
+### If running in HA Mode
+
+* Start the rehydrated scheduler instance along with enough cleaned up 
instances to
+meet the `-native_log_quorum_size`. The mesos-replicated log algorithm will 
replenish
+the "blank" scheduler instances with the information from the rehydrated 
instance.
+
+* Start any remaining scheduler instances.
+
+### If running in singleton mode
+
+* Start the single scheduler instance.
+
+

Modified: aurora/site/source/documentation/latest/operations/configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/configuration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/operations/configuration.md 
(original)
+++ aurora/site/source/documentation/latest/operations/configuration.md Tue Sep 
11 05:25:44 2018
@@ -104,7 +104,7 @@ can furthermore help with storage perfor
 ### `-native_log_zk_group_path`
 ZooKeeper path used for Mesos replicated log quorum discovery.
 
-See 
[code](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java)
 for
+See 
[code](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java)
 for
 other available Mesos replicated log configuration options and default values.
 
 ### Changing the Quorum Size
@@ -167,7 +167,7 @@ the latter needs to be enabled via:
 
     -enable_revocable_ram=true
 
-Unless you want to use the 
[default](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/resources/org/apache/aurora/scheduler/tiers.json)
+Unless you want to use the 
[default](https://github.com/apache/aurora/blob/master/src/main/resources/org/apache/aurora/scheduler/tiers.json)
 tier configuration, you will also have to specify a file path:
 
     -tier_config=path/to/tiers/config.json
@@ -312,19 +312,19 @@ increased).
 
 To enable this in the Scheduler, you can set the following options:
 
-    --enable_update_affinity=true
-    --update_affinity_reservation_hold_time=3mins
+    -enable_update_affinity=true
+    -update_affinity_reservation_hold_time=3mins
 
 You will need to tune the hold time to match the behavior you see in your 
cluster. If you have extremely
 high update throughput, you might have to extend it as processing updates 
could easily add significant
 delays between scheduling attempts. You may also have to tune scheduling 
parameters to achieve the
 throughput you need in your cluster. Some relevant settings (with defaults) 
are:
 
-    --max_schedule_attempts_per_sec=40
-    --initial_schedule_penalty=1secs
-    --max_schedule_penalty=1mins
-    --scheduling_max_batch_size=3
-    --max_tasks_per_schedule_attempt=5
+    -max_schedule_attempts_per_sec=40
+    -initial_schedule_penalty=1secs
+    -max_schedule_penalty=1mins
+    -scheduling_max_batch_size=3
+    -max_tasks_per_schedule_attempt=5
 
 There are metrics exposed by the Scheduler which can provide guidance on where 
the bottleneck is.
 Example metrics to look at:
@@ -337,3 +337,44 @@ Example metrics to look at:
 Most likely you'll run into limits with the number of update instances that 
can be processed per minute
 before you run into any other limits. So if your total work done per minute 
starts to exceed 2k instances,
 you may need to extend the update_affinity_reservation_hold_time.
+
+## Cluster Maintenance
+
+Aurora performs maintenance related task drains. One of the scheduler options 
that can control
+how often the scheduler polls for maintenance work can be controlled via,
+
+    -host_maintenance_polling_interval=1min
+
+## Enforcing SLA limitations
+
+Since tasks can specify their own `SLAPolicy`, the cluster needs to limit 
these SLA requirements.
+Too aggressive a requirement can permanently block any type of maintenance work
+(ex: OS/Kernel/Security upgrades) on a host and hold it hostage.
+
+An operator can control the limits for SLA requirements via these scheduler 
configuration options:
+
+    -max_sla_duration_secs=2hrs
+    -min_required_instances_for_sla_check=20
+
+_Note: These limits only apply for `CountSlaPolicy` and `PercentageSlaPolicy`._
+
+### Limiting Coordinator SLA
+
+With `CoordinatorSlaPolicy` the SLA calculation is off-loaded to an external 
HTTP service. Some
+relevant scheduler configuration options are,
+
+    -sla_coordinator_timeout=1min
+    -max_parallel_coordinated_maintenance=10
+
+Since handing off the SLA calculation to an external service can potentially 
block maintenance
+on hosts for an indefinite amount of time (either due to a mis-configured 
coordinator or due to
+a valid degraded service). In those situations the following metrics will be 
helpful to identify the
+offending tasks.
+
+    sla_coordinator_user_errors_*     (counter tracking number of times the 
coordinator for the task
+                                       returned a bad response.)
+    sla_coordinator_errors_*          (counter tracking number of times the 
scheduler was not able
+                                       to communicate with the coordinator of 
the task.)
+    sla_coordinator_lock_starvation_* (counter tracking number of times the 
scheduler was not able to
+                                       get the lock for the coordinator of the 
task.)
+

Modified: aurora/site/source/documentation/latest/reference/configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/reference/configuration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/reference/configuration.md 
(original)
+++ aurora/site/source/documentation/latest/reference/configuration.md Tue Sep 
11 05:25:44 2018
@@ -23,6 +23,7 @@ configuration design.
     - [Announcer Objects](#announcer-objects)
     - [Container Objects](#container)
     - [LifecycleConfig Objects](#lifecycleconfig-objects)
+    - [SlaPolicy Objects](#slapolicy-objects)
 - [Specifying Scheduling Constraints](#specifying-scheduling-constraints)
 - [Template Namespaces](#template-namespaces)
     - [mesos Namespace](#mesos-namespace)
@@ -343,7 +344,7 @@ Job Schema
   ```contact``` | String | Best email address to reach the owner of the job. 
For production jobs, this is usually a team mailing list.
   ```instances```| Integer | Number of instances (sometimes referred to as 
replicas or shards) of the task to create. (Default: 1)
   ```cron_schedule``` | String | Cron schedule in cron format. May only be 
used with non-service jobs. See [Cron Jobs](../../features/cron-jobs/) for more 
information. Default: None (not a cron job.)
-  ```cron_collision_policy``` | String | Policy to use when a cron job is 
triggered while a previous run is still active. KILL_EXISTING Kill the previous 
run, and schedule the new run CANCEL_NEW Let the previous run continue, and 
cancel the new run. (Default: KILL_EXISTING)
+  ```cron_collision_policy``` | String | Policy to use when a cron job is 
triggered while a previous run is still active. KILL\_EXISTING Kill the 
previous run, and schedule the new run CANCEL\_NEW Let the previous run 
continue, and cancel the new run. (Default: KILL_EXISTING)
   ```update_config``` | ```UpdateConfig``` object | Parameters for controlling 
the rate and policy of rolling updates.
   ```constraints``` | dict | Scheduling constraints for the tasks. See the 
section on the [constraint specification 
language](#specifying-scheduling-constraints)
   ```service``` | Boolean | If True, restart tasks regardless of success or 
failure. (Default: False)
@@ -359,6 +360,7 @@ Job Schema
   ```partition_policy``` | ```PartitionPolicy``` object | An optional 
partition policy that allows job owners to define how to handle partitions for 
running tasks (in partition-aware Aurora clusters)
   ```metadata``` | list of ```Metadata``` objects | list of ```Metadata``` 
objects for user's customized metadata information.
   ```executor_config``` | ```ExecutorConfig``` object | Allows choosing an 
alternative executor defined in `custom_executor_config` to be used instead of 
Thermos. Tasks will be launched with Thermos as the executor by default. See 
[Custom Executors](../../features/custom-executors/) for more info.
+  ```sla_policy``` |  Choice of ```CountSlaPolicy```, 
```PercentageSlaPolicy``` or ```CoordinatorSlaPolicy``` object | An optional 
SLA policy that allows job owners to describe the SLA requirements for the job. 
See [SlaPolicy Objects](#slapolicy-objects) for more information.
 
 
 ### UpdateConfig Objects
@@ -374,6 +376,35 @@ Parameters for controlling the rate and
 | ```rollback_on_failure```    | boolean  | When False, prevents auto rollback 
of a failed update (Default: True)
 | ```wait_for_batch_completion```| boolean | When True, all threads from a 
given batch will be blocked from picking up new instances until the entire 
batch is updated. This essentially simulates the legacy sequential updater 
algorithm. (Default: False)
 | ```pulse_interval_secs```    | Integer  |  Indicates a [coordinated 
update](../../features/job-updates/#coordinated-job-updates). If no pulses are 
received within the provided interval the update will be blocked. Beta-updater 
only. Will fail on submission when used with client updater. (Default: None)
+| ```sla_aware```              | boolean  | When True, updates will only 
update an instance if it does not break the task's specified [SLA 
Requirements](../../features/sla-requirements/). (Default: None)
+
+#### Using the `sla_aware` option
+
+There are some nuances around the `sla_aware` option that users should be 
aware of:
+
+- SLA-aware updates work in tandem with maintenance. Draining a host that has 
an instance of the
+job being updated affects the SLA and thus will be taken into account when the 
update determines
+whether or not it is safe to update another instance.
+- SLA-aware updates will use the 
[SLAPolicy](../../features/sla-requirements/#custom-sla) of the
+*newest* configuration when determining whether or not it is safe to update an 
instance. For
+example, if the current configuration specifies a
+[PercentageSlaPolicy](../../features/sla-requirements/#percentageslapolicy-objects)
 that allows for
+5% of instances to be down and the updated configuration increaes this value 
to 10%, the SLA
+calculation will be done using the 10% policy. Be mindful of this when doing 
an update that
+modifies the `SLAPolicy` since it may be possible to put the old configuration 
in a bad state
+that the new configuration would not be affected by. Additionally, if the 
update is rolled back,
+then the rollback will use the old `SLAPolicy` (or none if there was not one 
previously).
+- If using the 
[CoordinatorSlaPolicy](../../features/sla-requirements/#coordinatorslapolicy-objects),
+it is important to pay attention to the `batch_size` of the update. If you 
have a complex SLA
+requirement, then you may be limiting the throughput of your updates with an 
insufficient
+`batch_size`. For example, imagine you have a job with 9 instance that 
represents three
+replicated caches, and you can only update one instance per replica set: `[0 1 
2]
+[3 4 5] [6 7 8]` (the number indicates the instance ID and the brackets 
represent replica
+sets). If your `batch_size` is 3, then you will slowly update one replica set 
at a time. If your
+`batch_size` is 9, then you can update all replica sets in parallel and thus 
speeding up the update.
+- If an instance fails an SLA check for an update, then it will be rechecked 
starting at a delay
+from `sla_aware_kill_retry_min_delay` and exponentially increasing up to
+`sla_aware_kill_retry_max_delay`. These are cluster-operator set values.
 
 ### HealthCheckConfig Objects
 
@@ -564,7 +595,7 @@ See [Docker Command Line Reference](http
   ```graceful_shutdown_wait_secs``` | Integer | The amount of time (in 
seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before 
proceeding with the [task termination 
lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting).
 (Default: 5)
   ```shutdown_wait_secs```          | Integer | The amount of time (in 
seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding 
with the [task termination 
lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting).
 (Default: 5)
 
-#### graceful_shutdown_endpoint
+#### graceful\_shutdown\_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
@@ -581,6 +612,34 @@ does not shut down on its own after `shu
 forcefully killed.
 
 
+### SlaPolicy Objects
+
+Configuration for specifying custom [SLA 
requirements](../../features/sla-requirements/) for a job. There are 3 
supported SLA policies
+namely, [`CountSlaPolicy`](#countslapolicy-objects), 
[`PercentageSlaPolicy`](#percentageslapolicy-objects) and 
[`CoordinatorSlaPolicy`](#coordinatorslapolicy-objects).
+
+
+### CountSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```count```                       | Integer | The number of active instances 
required every `durationSecs`.
+  ```duration_secs```               | Integer | Minimum time duration a task 
needs to be `RUNNING` to be treated as active.
+
+### PercentageSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```percentage```                  | Float   | The percentage of active 
instances required every `durationSecs`.
+  ```duration_secs```               | Integer | Minimum time duration a task 
needs to be `RUNNING` to be treated as active.
+
+### CoordinatorSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```coordinator_url```             | String  | The URL to the 
[Coordinator](../../features/sla-requirements/#coordinator) service to be 
contacted before performing SLA affecting actions (job updates, host drains 
etc).
+  ```status_key```                  | String  | The field in the Coordinator 
response that indicates the SLA status for working on the task. (Default: 
`drain`)
+
+
 Specifying Scheduling Constraints
 =================================
 

Modified: 
aurora/site/source/documentation/latest/reference/scheduler-configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/reference/scheduler-configuration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- 
aurora/site/source/documentation/latest/reference/scheduler-configuration.md 
(original)
+++ 
aurora/site/source/documentation/latest/reference/scheduler-configuration.md 
Tue Sep 11 05:25:44 2018
@@ -106,6 +106,8 @@ Optional flags:
        Minimum guaranteed time for task history retention before any pruning 
is attempted.
 -history_prune_threshold (default (2, days))
        Time after which the scheduler will prune terminated task history.
+-host_maintenance_polling_interval (default (1, minute))
+       Interval between polling for pending host maintenance requests.
 -hostname
        The hostname to advertise in ZooKeeper instead of the locally-resolved 
hostname.
 -http_authentication_mechanism (default NONE)
@@ -134,6 +136,8 @@ Optional flags:
        Maximum delay between attempts to schedule a flapping task.
 -max_leading_duration (default (1, days))
        After leading for this duration, the scheduler should commit suicide.
+-max_parallel_coordinated_maintenance (default 10)
+       Maximum number of coordinators that can be contacted in parallel.
 -max_registration_delay (default (1, mins))
        Max allowable delay to allow the driver to register before aborting
 -max_reschedule_task_delay_on_startup (default (30, secs))
@@ -144,6 +148,8 @@ Optional flags:
        Maximum number of scheduling attempts to make per second.
 -max_schedule_penalty (default (1, mins))
        Maximum delay between attempts to schedule a PENDING tasks.
+-max_sla_duration_secs (default (2, hrs))
+       Maximum duration window for which SLA requirements are to be satisfied. 
This does not apply to jobs that have a CoordinatorSlaPolicy.
 -max_status_update_batch_size (default 1000) [must be > 0]
        The maximum number of status updates that can be processed in a batch.
 -max_task_event_batch_size (default 300) [must be > 0]
@@ -156,6 +162,8 @@ Optional flags:
        Upper limit on the number of failures allowed during a job update. This 
helps cap potentially unbounded entries into storage.
 -min_offer_hold_time (default (5, mins))
        Minimum amount of time to hold a resource offer before declining.
+-min_required_instances_for_sla_check (default 20)
+       Minimum number of instances required for a job to be eligible for SLA 
check. This does not apply to jobs that have a CoordinatorSlaPolicy.
 -native_log_election_retries (default 20)
        The maximum number of attempts to obtain a new log writer.
 -native_log_election_timeout (default (15, secs))
@@ -214,6 +222,14 @@ Optional flags:
        Path to shiro.ini for authentication and authorization configuration.
 -shiro_realm_modules (default [class 
org.apache.aurora.scheduler.http.api.security.IniShiroRealmModule])
        Guice modules for configuring Shiro Realms.
+-sla_aware_action_max_batch_size (default 300) [must be > 0]
+       The maximum number of sla aware update actions that can be processed in 
a batch.
+-sla_aware_kill_retry_min_delay (default (1, min)) [must be > 0]
+       The minimum amount of time to wait before retrying an SLA-aware kill 
(using a truncated binary backoff).
+-sla_aware_kill_retry_max_delay (default (5, min)) [must be > 0]
+       The maximum amount of time to wait before retrying an SLA-aware kill 
(using a truncated binary backoff).
+-sla_coordinator_timeout (default (1, min)) [must be > 0]
+       Timeout interval for communicating with Coordinator.
 -sla_non_prod_metrics (default [])
        Metric categories collected for non production tasks.
 -sla_prod_metrics (default [JOB_UPTIMES, PLATFORM_UPTIME, MEDIANS])

svn commit: r1840514 [11/11] - in /aurora/site: data/ publish/ publish/blog/ publish/community/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publish/documentation/0....

Reply via email to