Added: 
aurora/site/source/documentation/0.18.1/reference/observer-configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.1/reference/observer-configuration.md?rev=1813982&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.1/reference/observer-configuration.md 
(added)
+++ aurora/site/source/documentation/0.18.1/reference/observer-configuration.md 
Wed Nov  1 18:39:52 2017
@@ -0,0 +1,89 @@
+# Observer Configuration Reference
+
+The Aurora/Thermos observer can take a variety of configuration options 
through command-line arguments.
+A list of the available options can be seen by running `thermos_observer 
--long-help`.
+
+Please refer to the [Operator Configuration 
Guide](../../operations/configuration/) for details on how
+to properly set the most important options.
+
+```
+$ thermos_observer.pex --long-help
+Options:
+  -h, --help, --short-help
+                        show this help message and exit.
+  --long-help           show options from all registered modules, not just the
+                        __main__ module.
+  --mesos-root=MESOS_ROOT
+                        The mesos root directory to search for Thermos
+                        executor sandboxes [default: /var/lib/mesos]
+  --ip=IP               The IP address the observer will bind to. [default:
+                        0.0.0.0]
+  --port=PORT           The port on which the observer should listen.
+                        [default: 1338]
+  --polling_interval_secs=POLLING_INTERVAL_SECS
+                        The number of seconds between observer refresh
+                        attempts. [default: 5]
+  --task_process_collection_interval_secs=TASK_PROCESS_COLLECTION_INTERVAL_SECS
+                        The number of seconds between per task process
+                        resource collections. [default: 20]
+  --task_disk_collection_interval_secs=TASK_DISK_COLLECTION_INTERVAL_SECS
+                        The number of seconds between per task disk resource
+                        collections. [default: 60]
+
+  From module twitter.common.app:
+    --app_daemonize     Daemonize this application. [default: False]
+    --app_profile_output=FILENAME
+                        Dump the profiling output to a binary profiling
+                        format. [default: None]
+    --app_daemon_stderr=TWITTER_COMMON_APP_DAEMON_STDERR
+                        Direct this app's stderr to this file if daemonized.
+                        [default: /dev/null]
+    --app_debug         Print extra debugging information during application
+                        initialization. [default: False]
+    --app_rc_filename   Print the filename for the rc file and quit. [default:
+                        False]
+    --app_daemon_stdout=TWITTER_COMMON_APP_DAEMON_STDOUT
+                        Direct this app's stdout to this file if daemonized.
+                        [default: /dev/null]
+    --app_profiling     Run profiler on the code while it runs.  Note this can
+                        cause slowdowns. [default: False]
+    --app_ignore_rc_file
+                        Ignore default arguments from the rc file. [default:
+                        False]
+    --app_pidfile=TWITTER_COMMON_APP_PIDFILE
+                        The pidfile to use if --app_daemonize is specified.
+                        [default: None]
+
+  From module twitter.common.log.options:
+    --log_to_stdout=[scheme:]LEVEL
+                        OBSOLETE - legacy flag, use --log_to_stderr instead.
+                        [default: ERROR]
+    --log_to_stderr=[scheme:]LEVEL
+                        The level at which logging to stderr [default: ERROR].
+                        Takes either LEVEL or scheme:LEVEL, where LEVEL is one
+                        of ['INFO', 'NONE', 'WARN', 'ERROR', 'DEBUG', 'FATAL']
+                        and scheme is one of ['google', 'plain'].
+    --log_to_disk=[scheme:]LEVEL
+                        The level at which logging to disk [default: INFO].
+                        Takes either LEVEL or scheme:LEVEL, where LEVEL is one
+                        of ['INFO', 'NONE', 'WARN', 'ERROR', 'DEBUG', 'FATAL']
+                        and scheme is one of ['google', 'plain'].
+    --log_dir=DIR       The directory into which log files will be generated
+                        [default: /var/tmp].
+    --log_simple        Write a single log file rather than one log file per
+                        log level [default: False].
+    --log_to_scribe=[scheme:]LEVEL
+                        The level at which logging to scribe [default: NONE].
+                        Takes either LEVEL or scheme:LEVEL, where LEVEL is one
+                        of ['INFO', 'NONE', 'WARN', 'ERROR', 'DEBUG', 'FATAL']
+                        and scheme is one of ['google', 'plain'].
+    --scribe_category=CATEGORY
+                        The category used when logging to the scribe daemon.
+                        [default: python_default].
+    --scribe_buffer     Buffer messages when scribe is unavailable rather than
+                        dropping them. [default: False].
+    --scribe_host=HOST  The host running the scribe daemon. [default:
+                        localhost].
+    --scribe_port=PORT  The port used to connect to the scribe daemon.
+                        [default: 1463].
+```

Added: 
aurora/site/source/documentation/0.18.1/reference/scheduler-configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.1/reference/scheduler-configuration.md?rev=1813982&view=auto
==============================================================================
--- 
aurora/site/source/documentation/0.18.1/reference/scheduler-configuration.md 
(added)
+++ 
aurora/site/source/documentation/0.18.1/reference/scheduler-configuration.md 
Wed Nov  1 18:39:52 2017
@@ -0,0 +1,268 @@
+# Scheduler Configuration Reference
+
+The Aurora scheduler can take a variety of configuration options through 
command-line arguments.
+A list of the available options can be seen by running `aurora-scheduler 
-help`.
+
+Please refer to the [Operator Configuration 
Guide](../../operations/configuration/) for details on how
+to properly set the most important options.
+
+```
+$ aurora-scheduler -help
+-------------------------------------------------------------------------
+-h or -help to print this help message
+
+Required flags:
+-backup_dir [not null]
+       Directory to store backups under. Will be created if it does not exist.
+-cluster_name [not null]
+       Name to identify the cluster being served.
+-db_max_active_connection_count [must be > 0]
+       Max number of connections to use with database via MyBatis
+-db_max_idle_connection_count [must be > 0]
+       Max number of idle connections to the database via MyBatis
+-framework_authentication_file
+       Properties file which contains framework credentials to authenticate 
with Mesosmaster. Must contain the properties 'aurora_authentication_principal' 
and 'aurora_authentication_secret'.
+-ip
+       The ip address to listen. If not set, the scheduler will listen on all 
interfaces.
+-mesos_master_address [not null]
+       Address for the mesos master, can be a socket address or zookeeper path.
+-mesos_role
+       The Mesos role this framework will register as. The default is to left 
this empty, and the framework will register without any role and only receive 
unreserved resources in offer.
+-serverset_path [not null, must be non-empty]
+       ZooKeeper ServerSet path to register at.
+-shiro_after_auth_filter
+       Fully qualified class name of the servlet filter to be applied after 
the shiro auth filters are applied.
+-thermos_executor_path
+       Path to the thermos executor entry point.
+-tier_config [file must be readable]
+       Configuration file defining supported task tiers, task traits and 
behaviors.
+-webhook_config [file must exist, file must be readable]
+       Path to webhook configuration file.
+-zk_endpoints [must have at least 1 item]
+       Endpoint specification for the ZooKeeper servers.
+
+Optional flags:
+-allow_container_volumes (default false)
+       Allow passing in volumes in the job. Enabling this could pose a 
privilege escalation threat.
+-allow_docker_parameters (default false)
+       Allow to pass docker container parameters in the job.
+-allow_gpu_resource (default false)
+       Allow jobs to request Mesos GPU resource.
+-allowed_container_types (default [MESOS])
+       Container types that are allowed to be used by jobs.
+-async_slot_stat_update_interval (default (1, mins))
+       Interval on which to try to update open slot stats.
+-async_task_stat_update_interval (default (1, hrs))
+       Interval on which to try to update resource consumption stats.
+-async_worker_threads (default 8)
+       The number of worker threads to process async task operations with.
+-backup_interval (default (1, hrs))
+       Minimum interval on which to write a storage backup.
+-cron_scheduler_num_threads (default 10)
+       Number of threads to use for the cron scheduler thread pool.
+-cron_scheduling_max_batch_size (default 10) [must be > 0]
+       The maximum number of triggered cron jobs that can be processed in a 
batch.
+-cron_start_initial_backoff (default (5, secs))
+       Initial backoff delay while waiting for a previous cron run to be 
killed.
+-cron_start_max_backoff (default (1, mins))
+       Max backoff delay while waiting for a previous cron run to be killed.
+-cron_timezone (default GMT)
+       TimeZone to use for cron predictions.
+-custom_executor_config [file must exist, file must be readable]
+       Path to custom executor settings configuration file.
+-db_lock_timeout (default (1, mins))
+       H2 table lock timeout
+-db_row_gc_interval (default (2, hrs))
+       Interval on which to scan the database for unused row references.
+-default_docker_parameters (default {})
+       Default docker parameters for any job that does not explicitly declare 
parameters.
+-dlog_max_entry_size (default (512, KB))
+       Specifies the maximum entry size to append to the log. Larger entries 
will be split across entry Frames.
+-dlog_shutdown_grace_period (default (2, secs))
+       Specifies the maximum time to wait for scheduled checkpoint and 
snapshot actions to complete before forcibly shutting down.
+-dlog_snapshot_interval (default (1, hrs))
+       Specifies the frequency at which snapshots of local storage are taken 
and written to the log.
+-enable_cors_for
+       List of domains for which CORS support should be enabled.
+-enable_db_metrics (default true)
+       Whether to use MyBatis interceptor to measure the timing of intercepted 
Statements.
+-enable_h2_console (default false)
+       Enable H2 DB management console.
+-enable_mesos_fetcher (default false)
+       Allow jobs to pass URIs to the Mesos Fetcher. Note that enabling this 
feature could pose a privilege escalation threat.
+-enable_preemptor (default true)
+       Enable the preemptor and preemption
+-enable_revocable_cpus (default true)
+       Treat CPUs as a revocable resource.
+-enable_revocable_ram (default false)
+       Treat RAM as a revocable resource.
+-executor_user (default root)
+       User to start the executor. Defaults to "root". Set this to an 
unprivileged user if the mesos master was started with "--no-root_submissions". 
If set to anything other than "root", the executor will ignore the "role" 
setting for jobs since it can't use setuid() anymore. This means that all your 
jobs will run under the specified user and the user has to exist on the Mesos 
agents.
+-first_schedule_delay (default (1, ms))
+       Initial amount of time to wait before first attempting to schedule a 
PENDING task.
+-flapping_task_threshold (default (5, mins))
+       A task that repeatedly runs for less than this time is considered to be 
flapping.
+-framework_announce_principal (default false)
+       When 'framework_authentication_file' flag is set, the FrameworkInfo 
registered with the mesos master will also contain the principal. This is 
necessary if you intend to use mesos authorization via mesos ACLs. The default 
will change in a future release. Changing this value is backwards incompatible. 
For details, see MESOS-703.
+-framework_failover_timeout (default (21, days))
+       Time after which a framework is considered deleted.  SHOULD BE VERY 
HIGH.
+-framework_name (default Aurora)
+       Name used to register the Aurora framework with Mesos.
+-global_container_mounts (default [])
+       A comma separated list of mount points (in host:container form) to 
mount into all (non-mesos) containers.
+-history_max_per_job_threshold (default 100)
+       Maximum number of terminated tasks to retain in a job history.
+-history_min_retention_threshold (default (1, hrs))
+       Minimum guaranteed time for task history retention before any pruning 
is attempted.
+-history_prune_threshold (default (2, days))
+       Time after which the scheduler will prune terminated task history.
+-hostname
+       The hostname to advertise in ZooKeeper instead of the locally-resolved 
hostname.
+-http_authentication_mechanism (default NONE)
+       HTTP Authentication mechanism to use.
+-http_port (default 0)
+       The port to start an HTTP server on.  Default value will choose a 
random port.
+-initial_flapping_task_delay (default (30, secs))
+       Initial amount of time to wait before attempting to schedule a flapping 
task.
+-initial_schedule_penalty (default (1, secs))
+       Initial amount of time to wait before attempting to schedule a task 
that has failed to schedule.
+-initial_task_kill_retry_interval (default (5, secs))
+       When killing a task, retry after this delay if mesos has not responded, 
backing off up to transient_task_state_timeout
+-job_update_history_per_job_threshold (default 10)
+       Maximum number of completed job updates to retain in a job update 
history.
+-job_update_history_pruning_interval (default (15, mins))
+       Job update history pruning interval.
+-job_update_history_pruning_threshold (default (30, days))
+       Time after which the scheduler will prune completed job update history.
+-kerberos_debug (default false)
+       Produce additional Kerberos debugging output.
+-kerberos_server_keytab
+       Path to the server keytab.
+-kerberos_server_principal
+       Kerberos server principal to use, usually of the form 
HTTP/[email protected]
+-max_flapping_task_delay (default (5, mins))
+       Maximum delay between attempts to schedule a flapping task.
+-max_leading_duration (default (1, days))
+       After leading for this duration, the scheduler should commit suicide.
+-max_registration_delay (default (1, mins))
+       Max allowable delay to allow the driver to register before aborting
+-max_reschedule_task_delay_on_startup (default (30, secs))
+       Upper bound of random delay for pending task rescheduling on scheduler 
startup.
+-max_saved_backups (default 48)
+       Maximum number of backups to retain before deleting the oldest backups.
+-max_schedule_attempts_per_sec (default 40.0)
+       Maximum number of scheduling attempts to make per second.
+-max_schedule_penalty (default (1, mins))
+       Maximum delay between attempts to schedule a PENDING tasks.
+-max_status_update_batch_size (default 1000) [must be > 0]
+       The maximum number of status updates that can be processed in a batch.
+-max_task_event_batch_size (default 300) [must be > 0]
+       The maximum number of task state change events that can be processed in 
a batch.
+-max_tasks_per_job (default 4000) [must be > 0]
+       Maximum number of allowed tasks in a single job.
+-max_tasks_per_schedule_attempt (default 5) [must be > 0]
+       The maximum number of tasks to pick in a single scheduling attempt.
+-max_update_instance_failures (default 20000) [must be > 0]
+       Upper limit on the number of failures allowed during a job update. This 
helps cap potentially unbounded entries into storage.
+-min_offer_hold_time (default (5, mins))
+       Minimum amount of time to hold a resource offer before declining.
+-native_log_election_retries (default 20)
+       The maximum number of attempts to obtain a new log writer.
+-native_log_election_timeout (default (15, secs))
+       The timeout for a single attempt to obtain a new log writer.
+-native_log_file_path
+       Path to a file to store the native log data in.  If the parent 
directory doesnot exist it will be created.
+-native_log_quorum_size (default 1)
+       The size of the quorum required for all log mutations.
+-native_log_read_timeout (default (5, secs))
+       The timeout for doing log reads.
+-native_log_write_timeout (default (3, secs))
+       The timeout for doing log appends and truncations.
+-native_log_zk_group_path
+       A zookeeper node for use by the native log to track the master 
coordinator.
+-offer_filter_duration (default (5, secs))
+       Duration after which we expect Mesos to re-offer unused resources. A 
short duration improves scheduling performance in smaller clusters, but might 
lead to resource starvation for other frameworks if you run many frameworks in 
your cluster.
+-offer_hold_jitter_window (default (1, mins))
+       Maximum amount of random jitter to add to the offer hold time window.
+-offer_reservation_duration (default (3, mins))
+       Time to reserve a agent's offers while trying to satisfy a task 
preempting another.
+-populate_discovery_info (default false)
+       If true, Aurora populates DiscoveryInfo field of Mesos TaskInfo.
+-preemption_delay (default (3, mins))
+       Time interval after which a pending task becomes eligible to preempt 
other tasks
+-preemption_slot_finder_modules (default [class 
org.apache.aurora.scheduler.preemptor.PendingTaskProcessorModule, class 
org.apache.aurora.scheduler.preemptor.PreemptionVictimFilterModule])
+  Guice modules for replacing preemption logic.
+-preemption_slot_hold_time (default (5, mins))
+       Time to hold a preemption slot found before it is discarded.
+-preemption_slot_search_interval (default (1, mins))
+       Time interval between pending task preemption slot searches.
+-receive_revocable_resources (default false)
+       Allows receiving revocable resource offers from Mesos.
+-reconciliation_explicit_batch_interval (default (5, secs))
+       Interval between explicit batch reconciliation requests.
+-reconciliation_explicit_batch_size (default 1000) [must be > 0]
+       Number of tasks in a single batch request sent to Mesos for explicit 
reconciliation.
+-reconciliation_explicit_interval (default (60, mins))
+       Interval on which scheduler will ask Mesos for status updates of all 
non-terminal tasks known to scheduler.
+-reconciliation_implicit_interval (default (60, mins))
+       Interval on which scheduler will ask Mesos for status updates of all 
non-terminal tasks known to Mesos.
+-reconciliation_initial_delay (default (1, mins))
+       Initial amount of time to delay task reconciliation after scheduler 
start up.
+-reconciliation_schedule_spread (default (30, mins))
+       Difference between explicit and implicit reconciliation intervals 
intended to create a non-overlapping task reconciliation schedule.
+-require_docker_use_executor (default true)
+       If false, Docker tasks may run without an executor (EXPERIMENTAL)
+-scheduling_max_batch_size (default 3) [must be > 0]
+       The maximum number of scheduling attempts that can be processed in a 
batch.
+-serverset_endpoint_name (default http)
+       Name of the scheduler endpoint published in ZooKeeper.
+-shiro_ini_path
+       Path to shiro.ini for authentication and authorization configuration.
+-shiro_realm_modules (default [class 
org.apache.aurora.scheduler.http.api.security.IniShiroRealmModule])
+       Guice modules for configuring Shiro Realms.
+-sla_non_prod_metrics (default [])
+       Metric categories collected for non production tasks.
+-sla_prod_metrics (default [JOB_UPTIMES, PLATFORM_UPTIME, MEDIANS])
+       Metric categories collected for production tasks.
+-sla_stat_refresh_interval (default (1, mins))
+       The SLA stat refresh interval.
+-slow_query_log_threshold (default (25, ms))
+       Log all queries that take at least this long to execute.
+-slow_query_log_threshold (default (25, ms))
+       Log all queries that take at least this long to execute.
+-snapshot_hydrate_stores (default [locks, hosts, quota, job_updates])
+       Which H2-backed stores to fully hydrate on the Snapshot.
+-stat_retention_period (default (1, hrs))
+       Time for a stat to be retained in memory before expiring.
+-stat_sampling_interval (default (1, secs))
+       Statistic value sampling interval.
+-task_assigner_modules (default [class 
org.apache.aurora.scheduler.state.FirstFitTaskAssignerModule])
+  Guice modules for replacing task assignment logic.
+-thermos_executor_cpu (default 0.25)
+       The number of CPU cores to allocate for each instance of the executor.
+-thermos_executor_flags
+       Extra arguments to be passed to the thermos executor
+-thermos_executor_ram (default (128, MB))
+       The amount of RAM to allocate for each instance of the executor.
+-thermos_executor_resources (default [])
+       A comma separated list of additional resources to copy into the 
sandbox.Note: if thermos_executor_path is not the thermos_executor.pex file 
itself, this must include it.
+-thermos_home_in_sandbox (default false)
+       If true, changes HOME to the sandbox before running the executor. This 
primarily has the effect of causing the executor and runner to extract 
themselves into the sandbox.
+-transient_task_state_timeout (default (5, mins))
+       The amount of time after which to treat a task stuck in a transient 
state as LOST.
+-use_beta_db_task_store (default false)
+       Whether to use the experimental database-backed task store.
+-viz_job_url_prefix (default )
+       URL prefix for job container stats.
+-zk_chroot_path
+       chroot path to use for the ZooKeeper connections
+-zk_digest_credentials
+       user:password to use when authenticating with ZooKeeper.
+-zk_in_proc (default false)
+       Launches an embedded zookeeper server for local testing causing 
-zk_endpoints to be ignored if specified.
+-zk_session_timeout (default (4, secs))
+       The ZooKeeper session timeout.
+-zk_use_curator (default true)
+       DEPRECATED: Uses Apache Curator as the zookeeper client; otherwise a 
copy of Twitter commons/zookeeper (the legacy library) is used.
+-------------------------------------------------------------------------
+```

Added: aurora/site/source/documentation/0.18.1/reference/scheduler-endpoints.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.1/reference/scheduler-endpoints.md?rev=1813982&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.1/reference/scheduler-endpoints.md 
(added)
+++ aurora/site/source/documentation/0.18.1/reference/scheduler-endpoints.md 
Wed Nov  1 18:39:52 2017
@@ -0,0 +1,19 @@
+# HTTP endpoints
+
+There are a number of HTTP endpoints that the Aurora scheduler exposes. These 
allow various
+operational tasks to be performed on the scheduler. Below is an (incomplete) 
list of such endpoints
+and a brief explanation of what they do.
+
+## Leader health
+The /leaderhealth endpoint enables performing health checks on the scheduler 
instances inorder
+to forward requests to the leading scheduler. This is typically used by a load 
balancer such as
+HAProxy or AWS ELB.
+
+When a HTTP GET request is issued on this endpoint, it responds as follows:
+
+- If the instance that received the GET request is the leading scheduler, a 
HTTP status code of
+  `200 OK` is returned.
+- If the instance that received the GET request is not the leading scheduler 
but a leader does
+  exist, a HTTP status code of `503 SERVICE_UNAVAILABLE` is returned.
+- If no leader currently exists or the leader is unknown, a HTTP status code 
of `502 BAD_GATEWAY`
+  is returned.

Added: aurora/site/source/documentation/0.18.1/reference/task-lifecycle.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.1/reference/task-lifecycle.md?rev=1813982&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.1/reference/task-lifecycle.md (added)
+++ aurora/site/source/documentation/0.18.1/reference/task-lifecycle.md Wed Nov 
 1 18:39:52 2017
@@ -0,0 +1,148 @@
+# Task Lifecycle
+
+When Aurora reads a configuration file and finds a `Job` definition, it:
+
+1.  Evaluates the `Job` definition.
+2.  Splits the `Job` into its constituent `Task`s.
+3.  Sends those `Task`s to the scheduler.
+4.  The scheduler puts the `Task`s into `PENDING` state, starting each
+    `Task`'s life cycle.
+
+
+![Life of a task](../images/lifeofatask.png)
+
+Please note, a couple of task states described below are missing from
+this state diagram.
+
+
+## PENDING to RUNNING states
+
+When a `Task` is in the `PENDING` state, the scheduler constantly
+searches for machines satisfying that `Task`'s resource request
+requirements (RAM, disk space, CPU time) while maintaining configuration
+constraints such as "a `Task` must run on machines  dedicated  to a
+particular role" or attribute limit constraints such as "at most 2
+`Task`s from the same `Job` may run on each rack". When the scheduler
+finds a suitable match, it assigns the `Task` to a machine and puts the
+`Task` into the `ASSIGNED` state.
+
+From the `ASSIGNED` state, the scheduler sends an RPC to the agent
+machine containing `Task` configuration, which the agent uses to spawn
+an executor responsible for the `Task`'s lifecycle. When the scheduler
+receives an acknowledgment that the machine has accepted the `Task`,
+the `Task` goes into `STARTING` state.
+
+`STARTING` state initializes a `Task` sandbox. When the sandbox is fully
+initialized, Thermos begins to invoke `Process`es. Also, the agent
+machine sends an update to the scheduler that the `Task` is
+in `RUNNING` state, only after the task satisfies the liveness requirements.
+See [Health Checking](../features/services#health-checking) for more details
+for how to configure health checks.
+
+
+
+## RUNNING to terminal states
+
+There are various ways that an active `Task` can transition into a terminal
+state. By definition, it can never leave this state. However, depending on
+nature of the termination and the originating `Job` definition
+(e.g. `service`, `max_task_failures`), a replacement `Task` might be
+scheduled.
+
+### Natural Termination: FINISHED, FAILED
+
+A `RUNNING` `Task` can terminate without direct user interaction. For
+example, it may be a finite computation that finishes, even something as
+simple as `echo hello world.`, or it could be an exceptional condition in
+a long-lived service. If the `Task` is successful (its underlying
+processes have succeeded with exit status `0` or finished without
+reaching failure limits) it moves into `FINISHED` state. If it finished
+after reaching a set of failure limits, it goes into `FAILED` state.
+
+A terminated `TASK` which is subject to rescheduling will be temporarily
+`THROTTLED`, if it is considered to be flapping. A task is flapping, if its
+previous invocation was terminated after less than 5 minutes (scheduler
+default). The time penalty a task has to remain in the `THROTTLED` state,
+before it is eligible for rescheduling, increases with each consecutive
+failure.
+
+### Forceful Termination: KILLING, RESTARTING
+
+You can terminate a `Task` by issuing an `aurora job kill` command, which
+moves it into `KILLING` state. The scheduler then sends the agent a
+request to terminate the `Task`. If the scheduler receives a successful
+response, it moves the Task into `KILLED` state and never restarts it.
+
+If a `Task` is forced into the `RESTARTING` state via the `aurora job restart`
+command, the scheduler kills the underlying task but in parallel schedules
+an identical replacement for it.
+
+In any case, the responsible executor on the agent follows an escalation
+sequence when killing a running task:
+
+  1. If a `HttpLifecycleConfig` is not present, skip to (4).
+  2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds.
+  3. Send a POST to the `shutdown_endpoint` and wait 5 seconds.
+  4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds.
+  5. Send SIGKILL (`kill -9`).
+
+If the executor notices that all `Process`es in a `Task` have aborted
+during this sequence, it will not proceed with subsequent steps.
+Note that graceful shutdown is best-effort, and due to the many
+inevitable realities of distributed systems, it may not be performed.
+
+### Unexpected Termination: LOST
+
+If a `Task` stays in a transient task state for too long (such as `ASSIGNED`
+or `STARTING`), the scheduler forces it into `LOST` state, creating a new
+`Task` in its place that's sent into `PENDING` state.
+
+In addition, if the Mesos core tells the scheduler that a agent has
+become unhealthy (or outright disappeared), the `Task`s assigned to that
+agent go into `LOST` state and new `Task`s are created in their place.
+From `PENDING` state, there is no guarantee a `Task` will be reassigned
+to the same machine unless job constraints explicitly force it there.
+
+### Giving Priority to Production Tasks: PREEMPTING
+
+Sometimes a Task needs to be interrupted, such as when a non-production
+Task's resources are needed by a higher priority production Task. This
+type of interruption is called a *pre-emption*. When this happens in
+Aurora, the non-production Task is killed and moved into
+the `PREEMPTING` state  when both the following are true:
+
+- The task being killed is a non-production task.
+- The other task is a `PENDING` production task that hasn't been
+  scheduled due to a lack of resources.
+
+The scheduler UI shows the non-production task was preempted in favor of
+the production task. At some point, tasks in `PREEMPTING` move to `KILLED`.
+
+Note that non-production tasks consuming many resources are likely to be
+preempted in favor of production tasks.
+
+### Making Room for Maintenance: DRAINING
+
+Cluster operators can set agent into maintenance mode. This will transition
+all `Task` running on this agent into `DRAINING` and eventually to `KILLED`.
+Drained `Task`s will be restarted on other agents for which no maintenance
+has been announced yet.
+
+
+
+## State Reconciliation
+
+Due to the many inevitable realities of distributed systems, there might
+be a mismatch of perceived and actual cluster state (e.g. a machine returns
+from a `netsplit` but the scheduler has already marked all its `Task`s as
+`LOST` and rescheduled them).
+
+Aurora regularly runs a state reconciliation process in order to detect
+and correct such issues (e.g. by killing the errant `RUNNING` tasks).
+By default, the proper detection of all failure scenarios and inconsistencies
+may take up to an hour.
+
+To emphasize this point: there is no uniqueness guarantee for a single
+instance of a job in the presence of network partitions. If the `Task`
+requires that, it should be baked in at the application level using a
+distributed coordination service such as Zookeeper.

Modified: aurora/site/source/documentation/latest/development/db-migration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/db-migration.md?rev=1813982&r1=1813981&r2=1813982&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/db-migration.md 
(original)
+++ aurora/site/source/documentation/latest/development/db-migration.md Wed Nov 
 1 18:39:52 2017
@@ -14,7 +14,7 @@ When adding or altering tables or changi
 
[schema.sql](../../src/main/resources/org/apache/aurora/scheduler/storage/db/schema.sql),
 a new
 migration class should be created under the 
org.apache.aurora.scheduler.storage.db.migration
 package. The class should implement the 
[MigrationScript](https://github.com/mybatis/migrations/blob/master/src/main/java/org/apache/ibatis/migration/MigrationScript.java)
-interface (see 
[V001_TestMigration](https://github.com/apache/aurora/blob/rel/0.18.0/src/test/java/org/apache/aurora/scheduler/storage/db/testmigration/V001_TestMigration.java)
+interface (see 
[V001_TestMigration](https://github.com/apache/aurora/blob/rel/0.18.1/src/test/java/org/apache/aurora/scheduler/storage/db/testmigration/V001_TestMigration.java)
 as an example). The upgrade and downgrade scripts are defined in this class. 
When restoring a
 snapshot the list of migrations on the classpath is compared to the list of 
applied changes in the
 DB. Any changes that have not yet been applied are executed and their 
downgrade script is stored

Modified: aurora/site/source/documentation/latest/development/thrift.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/thrift.md?rev=1813982&r1=1813981&r2=1813982&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/thrift.md (original)
+++ aurora/site/source/documentation/latest/development/thrift.md Wed Nov  1 
18:39:52 2017
@@ -6,7 +6,7 @@ client/server RPC protocol as well as fo
 correctly handling additions and renames of the existing members, field 
removals must be done
 carefully to ensure backwards compatibility and provide predictable 
deprecation cycle. This
 document describes general guidelines for making Thrift schema changes to the 
existing fields in
-[api.thrift](https://github.com/apache/aurora/blob/rel/0.18.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+[api.thrift](https://github.com/apache/aurora/blob/rel/0.18.1/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
 
 It is highly recommended to go through the
 [Thrift: The Missing 
Guide](http://diwakergupta.github.io/thrift-missing-guide/) first to refresh on
@@ -33,7 +33,7 @@ communicate with scheduler/client from v
 * Add a new field as an eventual replacement of the old one and implement a 
dual read/write
 anywhere the old field is used. If a thrift struct is mapped in the DB store 
make sure both columns
 are marked as `NOT NULL`
-* Check 
[storage.thrift](https://github.com/apache/aurora/blob/rel/0.18.0/api/src/main/thrift/org/apache/aurora/gen/storage.thrift)
 to see if
+* Check 
[storage.thrift](https://github.com/apache/aurora/blob/rel/0.18.1/api/src/main/thrift/org/apache/aurora/gen/storage.thrift)
 to see if
 the affected struct is stored in Aurora scheduler storage. If so, it's almost 
certainly also
 necessary to perform a [DB migration](../db-migration/).
 * Add a deprecation jira ticket into the vCurrent+1 release candidate

Modified: aurora/site/source/documentation/latest/features/job-updates.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/features/job-updates.md?rev=1813982&r1=1813981&r2=1813982&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/features/job-updates.md (original)
+++ aurora/site/source/documentation/latest/features/job-updates.md Wed Nov  1 
18:39:52 2017
@@ -70,7 +70,7 @@ acknowledging ("heartbeating") job updat
 service updates where explicit job health monitoring is vital during the 
entire job update
 lifecycle. Such job updates would rely on an external service (or a custom 
client) periodically
 pulsing an active coordinated job update via a
-[pulseJobUpdate 
RPC](https://github.com/apache/aurora/blob/rel/0.18.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+[pulseJobUpdate 
RPC](https://github.com/apache/aurora/blob/rel/0.18.1/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
 
 A coordinated update is defined by setting a positive
 [pulse_interval_secs](../../reference/configuration/#updateconfig-objects) 
value in job configuration

Modified: aurora/site/source/documentation/latest/features/sla-metrics.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/features/sla-metrics.md?rev=1813982&r1=1813981&r2=1813982&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/features/sla-metrics.md (original)
+++ aurora/site/source/documentation/latest/features/sla-metrics.md Wed Nov  1 
18:39:52 2017
@@ -63,7 +63,7 @@ relevant to uptime calculations. By appl
 transition records, we can build a deterministic downtime trace for every 
given service instance.
 
 A task going through a state transition carries one of three possible SLA 
meanings
-(see 
[SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.18.0/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java)
 for
+(see 
[SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.18.1/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java)
 for
 sla-to-task-state mapping):
 
 * Task is UP: starts a period where the task is considered to be up and 
running from the Aurora
@@ -110,7 +110,7 @@ metric that helps track the dependency o
 * Per job - `sla_<job_key>_mtta_ms`
 * Per cluster - `sla_cluster_mtta_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.18.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.18.1/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtta_ms`
     * `sla_cpu_medium_mtta_ms`
@@ -147,7 +147,7 @@ for a task.*
 * Per job - `sla_<job_key>_mtts_ms`
 * Per cluster - `sla_cluster_mtts_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.18.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.18.1/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtts_ms`
     * `sla_cpu_medium_mtts_ms`
@@ -182,7 +182,7 @@ reflecting on the overall time it takes
 * Per job - `sla_<job_key>_mttr_ms`
 * Per cluster - `sla_cluster_mttr_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are 
defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.18.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.18.1/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mttr_ms`
     * `sla_cpu_medium_mttr_ms`

Modified: aurora/site/source/documentation/latest/operations/configuration.md
URL: 
http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/configuration.md?rev=1813982&r1=1813981&r2=1813982&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/operations/configuration.md 
(original)
+++ aurora/site/source/documentation/latest/operations/configuration.md Wed Nov 
 1 18:39:52 2017
@@ -104,7 +104,7 @@ can furthermore help with storage perfor
 ### `-native_log_zk_group_path`
 ZooKeeper path used for Mesos replicated log quorum discovery.
 
-See 
[code](https://github.com/apache/aurora/blob/rel/0.18.0/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java)
 for
+See 
[code](https://github.com/apache/aurora/blob/rel/0.18.1/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java)
 for
 other available Mesos replicated log configuration options and default values.
 
 ### Changing the Quorum Size
@@ -167,7 +167,7 @@ the latter needs to be enabled via:
 
     -enable_revocable_ram=true
 
-Unless you want to use the 
[default](https://github.com/apache/aurora/blob/rel/0.18.0/src/main/resources/org/apache/aurora/scheduler/tiers.json)
+Unless you want to use the 
[default](https://github.com/apache/aurora/blob/rel/0.18.1/src/main/resources/org/apache/aurora/scheduler/tiers.json)
 tier configuration, you will also have to specify a file path:
 
     -tier_config=path/to/tiers/config.json


Reply via email to