Repository: aurora Updated Branches: refs/heads/master 6cb2d4f69 -> c85bffdd6
Extend operator documentation Included changes: * new cluster upgrade instructions * docs for several best practices collected on the mailinglist * extracted and extended troubleshooting guide for new cluster operators * several minor formatting fixes Reviewed at https://reviews.apache.org/r/58651/ Project: http://git-wip-us.apache.org/repos/asf/aurora/repo Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/c85bffdd Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/c85bffdd Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/c85bffdd Branch: refs/heads/master Commit: c85bffdd6f68312261697eee868d57069adda434 Parents: 6cb2d4f Author: Stephan Erb <[email protected]> Authored: Tue Apr 25 23:26:43 2017 +0200 Committer: Stephan Erb <[email protected]> Committed: Tue Apr 25 23:26:43 2017 +0200 ---------------------------------------------------------------------- docs/README.md | 3 + docs/features/custom-executors.md | 15 ++-- docs/features/webhooks.md | 2 +- docs/operations/backup-restore.md | 10 +-- docs/operations/configuration.md | 63 +++++++++++++++-- docs/operations/installation.md | 70 ++----------------- docs/operations/storage.md | 7 +- docs/operations/troubleshooting.md | 106 +++++++++++++++++++++++++++++ docs/operations/upgrades.md | 41 +++++++++++ docs/reference/scheduler-endpoints.md | 10 +-- 10 files changed, 237 insertions(+), 90 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/README.md ---------------------------------------------------------------------- diff --git a/docs/README.md b/docs/README.md index dfd3a23..166bf1c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -35,6 +35,8 @@ For those that wish to manage and fine-tune an Aurora cluster. * [Installation](operations/installation.md) * [Configuration](operations/configuration.md) + * [Upgrades](operations/upgrades.md) + * [Troubleshooting](operations/troubleshooting.md) * [Monitoring](operations/monitoring.md) * [Security](operations/security.md) * [Storage](operations/storage.md) @@ -55,6 +57,7 @@ The complete reference of commands, configuration options, and scheduler interna - [Client Cluster Configuration](reference/client-cluster-configuration.md) * [Scheduler Configuration](reference/scheduler-configuration.md) * [Observer Configuration](reference/observer-configuration.md) + * [Endpoints](reference/scheduler-endpoints.md) ## Additional Resources * [Tools integrating with Aurora](additional-resources/tools.md) http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/features/custom-executors.md ---------------------------------------------------------------------- diff --git a/docs/features/custom-executors.md b/docs/features/custom-executors.md index 40fc118..1357c1e 100644 --- a/docs/features/custom-executors.md +++ b/docs/features/custom-executors.md @@ -36,6 +36,7 @@ uris (optional) | List of resources to download into the task sandbox. shell (optional) | Run executor via shell. A note on the command property (from [mesos.proto](https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto)): + ``` 1) If 'shell == true', the command will be launched via shell (i.e., /bin/sh -c 'value'). The 'value' specified will be @@ -68,14 +69,15 @@ scalar (required) | Value in float for cpus or int for mem (in MBs) ### volume_mounts (list) -Property | Description ------------------------- | --------------------------------- -host_path (required) | Host path to mount inside the container. -container_path (required) | Path inside the container where `host_path` will be mounted. -mode (required) | Mode in which to mount the volume, Read-Write (RW) or Read-Only (RO). +Property | Description +--------------------------- | --------------------------------- +host_path (required) | Host path to mount inside the container. +container_path (required) | Path inside the container where `host_path` will be mounted. +mode (required) | Mode in which to mount the volume, Read-Write (RW) or Read-Only (RO). A sample configuration is as follows: -``` + +```json [ { "executor": { @@ -135,7 +137,6 @@ A sample configuration is as follows: "task_prefix": "my-executor-" } ] - ``` It should be noted that if you do not use Thermos or a Thermos based executor, links in the scheduler's http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/features/webhooks.md ---------------------------------------------------------------------- diff --git a/docs/features/webhooks.md b/docs/features/webhooks.md index 075aeec..a060975 100644 --- a/docs/features/webhooks.md +++ b/docs/features/webhooks.md @@ -19,6 +19,7 @@ Below is a sample configuration: ``` And an example of a response that you will get back: + ```json { "task": @@ -77,4 +78,3 @@ And an example of a response that you will get back: }, "oldState":{}} ``` - http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/operations/backup-restore.md ---------------------------------------------------------------------- diff --git a/docs/operations/backup-restore.md b/docs/operations/backup-restore.md index da467c3..15e6dd2 100644 --- a/docs/operations/backup-restore.md +++ b/docs/operations/backup-restore.md @@ -3,7 +3,7 @@ **Be sure to read the entire page before attempting to restore from a backup, as it may have unintended consequences.** -# Summary +## Summary The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an earlier, backed up, version and requires all schedulers to be taken down temporarily while @@ -18,7 +18,7 @@ so any tasks that have been rescheduled since the backup was taken will be kille Instructions below have been verified in [Vagrant environment](../getting-started/vagrant.md) and with minor syntax/path changes should be applicable to any Aurora cluster. -# Preparation +## Preparation Follow these steps to prepare the cluster for restoring from a backup: @@ -54,7 +54,7 @@ accomplished by updating the following scheduler configuration options: * Restart all schedulers -# Cleanup and re-initialize Mesos replicated log +## Cleanup and re-initialize Mesos replicated log Get rid of the corrupted files and re-initialize Mesos replicated log: @@ -63,7 +63,7 @@ Get rid of the corrupted files and re-initialize Mesos replicated log: * Initialize Mesos replica's log file: `sudo mesos-log initialize --path=<-native_log_file_path>` * Start schedulers -# Restore from backup +## Restore from backup At this point the scheduler is ready to rehydrate from the backup: @@ -87,5 +87,5 @@ See `aurora_admin help <command>` for usage details. the provided backup snapshot and initiate a mandatory failover `aurora_admin scheduler_commit_recovery --bypass-leader-redirect <cluster>` -# Cleanup +## Cleanup Undo any modification done during [Preparation](#preparation) sequence. http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/operations/configuration.md ---------------------------------------------------------------------- diff --git a/docs/operations/configuration.md b/docs/operations/configuration.md index 203f3be..f0581ea 100644 --- a/docs/operations/configuration.md +++ b/docs/operations/configuration.md @@ -29,7 +29,6 @@ Like Mesos, Aurora uses command-line flags for runtime configuration. As such th # Environment variables controlling libmesos export JAVA_HOME=... export GLOG_v=1 - # Port and public ip used to communicate with the Mesos master and for the replicated log export LIBPROCESS_PORT=8083 export LIBPROCESS_IP=192.168.33.7 @@ -38,6 +37,36 @@ Like Mesos, Aurora uses command-line flags for runtime configuration. As such th That way Aurora's current flags are visible in `ps` and in the `/vars` admin endpoint. +## JVM Configuration + +JVM settings are dependent on your environment and cluster size. They might require +custom tuning. As a starting point, we recommend: + +* Ensure the initial (`-Xms`) and maximum (`-Xmx`) heap size are idential to prevent heap resizing + at runtime. +* Either `-XX:+UseConcMarkSweepGC` or `-XX:+UseG1GC -XX:+UseStringDeduplication` are + sane defaults for the garbage collector. +* `-Djava.net.preferIPv4Stack=true` makes sense in most cases as well. + + +## Network Configuration + +By default, Aurora binds to all interfaces and auto-discovers its hostname. To reduce ambiguity +it helps to hardcode them though: + + -http_port=8081 + -ip=192.168.33.7 + -hostname="aurora1.us-east1.example.org" + +Two environment variables control the ip and port for the communication with the Mesos master +and for the replicated log used by Aurora: + + export LIBPROCESS_PORT=8083 + export LIBPROCESS_IP=192.168.33.7 + +It is important that those can be reached from all Mesos master and Aurora scheduler instances. + + ## Replicated Log Configuration Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. Only one scheduler is @@ -64,8 +93,13 @@ should be set to `3`. *Incorrectly setting this flag will cause data corruption to occur!* ### `-native_log_file_path` -Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably SSD) -for Mesos replicated log files to ensure optimal storage performance. +Location of the Mesos replicated log files. For optimal and consistent performance, consider +allocating a dedicated disk (preferably SSD) for the replicated log. Ensure that this disk is not +used by anything else (e.g. no process logging) and in particular that it is a real disk +and not just a partition. + +Even when a dedicated disk is used, switching from `CFQ` to `deadline` I/O scheduler of Linux kernel +can furthermore help with storage performance in Aurora ([see this ticket for details](https://issues.apache.org/jira/browse/AURORA-1211)). ### `-native_log_zk_group_path` ZooKeeper path used for Mesos replicated log quorum discovery. @@ -91,8 +125,10 @@ or truncating of the replicated log used by Aurora. In that case, see the docume Configuration options for the Aurora scheduler backup manager. -* `-backup_interval`: The interval on which the scheduler writes local storage backups. The default is every hour. -* `-backup_dir`: Directory to write backups to. +* `-backup_interval`: The interval on which the scheduler writes local storage backups. + The default is every hour. +* `-backup_dir`: Directory to write backups to. As stated above, this should not be co-located on the + same disk as the replicated log. * `-max_saved_backups`: Maximum number of backups to retain before deleting the oldest backup(s). @@ -137,6 +173,23 @@ tier configuration, you will also have to specify a file path: -tier_config=path/to/tiers/config.json +## Multi-Framework Setup + +Aurora holds onto Mesos offers in order to provide efficient scheduling and +[preemption](../features/multitenancy.md#preemption). This is problematic in multi-framework +environments as Aurora might starve other frameworks. + +With a downside of increased scheduling latency, Aurora can be configured to be more cooperative: + +* Lowering `-min_offer_hold_time` (e.g. to `1mins`) can ensure unused offers are returned back to + Mesos more frequently. +* Increasing `-offer_filter_duration` (e.g to `30secs`) will instruct Mesos + not to re-offer rejected resources for the given duration. + +Setting a [minimum amount of resources](http://mesos.apache.org/documentation/latest/quota/) for +each Mesos role can furthermore help to ensure no framework is starved entirely. + + ## Containers Both the Mesos and Docker containerizers require configuration of the Mesos agent. http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/operations/installation.md ---------------------------------------------------------------------- diff --git a/docs/operations/installation.md b/docs/operations/installation.md index f9b04d4..82f5d18 100644 --- a/docs/operations/installation.md +++ b/docs/operations/installation.md @@ -26,6 +26,8 @@ profiles: A small number of machines (typically 3 or 5) responsible for cluster orchestration. In most cases it is fine to co-locate these components in anything but very large clusters (> 1000 machines). Beyond that point, operators will likely want to manage these services on separate machines. +In particular, you will want to use separate ZooKeeper ensembles for leader election and +service discovery. Otherwise a service discovery error or outage can take down the entire cluster. In practice, 5 coordinators have been shown to reliably manage clusters with tens of thousands of machines. @@ -140,7 +142,7 @@ CentOS: `sudo systemctl start aurora` wget -c https://apache.bintray.com/aurora/centos-7/aurora-executor-0.17.0-1.el7.centos.aurora.x86_64.rpm sudo yum install -y aurora-executor-0.17.0-1.el7.centos.aurora.x86_64.rpm -### Configuration +### Worker Configuration The executor typically does not require configuration. Command line arguments can be passed to the executor using a command line argument on the scheduler. @@ -194,6 +196,7 @@ Make an edit to add the `--mesos-root` flag resulting in something like: --log_to_stderr=google:INFO ) + ## Installing the client ### Ubuntu Trusty @@ -214,7 +217,7 @@ Make an edit to add the `--mesos-root` flag resulting in something like: brew upgrade brew install aurora-cli -### Configuration +### Client Configuration Client configuration lives in a json file that describes the clusters available and how to reach them. By default this file is at `/etc/aurora/clusters.json`. @@ -247,66 +250,7 @@ are identical for both. sudo yum -y install mesos-1.1.0 - ## Troubleshooting -So you've started your first cluster and are running into some issues? We've collected some common -stumbling blocks and solutions here to help get you moving. - -### Replicated log not initialized - -#### Symptoms -- Scheduler RPCs and web interface claim `Storage is not READY` -- Scheduler log repeatedly prints messages like - - ``` - I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status - received a broadcasted recover request - I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response - from a replica in EMPTY status - ``` - -#### Solution -When you create a new cluster, you need to inform a quorum of schedulers that they are safe to -consider their database to be empty by [initializing](#finalizing) the -replicated log. This is done to prevent the scheduler from modifying the cluster state in the event -of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path. - - -### Scheduler not registered - -#### Symptoms -Scheduler log contains - - Framework has not been registered within the tolerated delay. - -#### Solution -Double-check that the scheduler is configured correctly to reach the Mesos master. If you are registering -the master in ZooKeeper, make sure command line argument to the master: - --zk=zk://$ZK_HOST:2181/mesos/master - -is the same as the one on the scheduler: - - -mesos_master_address=zk://$ZK_HOST:2181/mesos/master - - -### Scheduler not running - -### Symptom -The scheduler process commits suicide regularly. This happens under error conditions, but -also on purpose in regular intervals. - -## Solution -Aurora is meant to be run under supervision. You have to configure a supervisor like -[Monit](http://mmonit.com/monit/) or [supervisord](http://supervisord.org/) to run the scheduler -and restart it whenever it fails or exists on purpose. - -Aurora supports an active health checking protocol on its admin HTTP interface - if a `GET /health` -times out or returns anything other than `200 OK` the scheduler process is unhealthy and should be -restarted. - -For example, monit can be configured with - - if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart - -assuming you set `-http_port=8081`. +So you've started your first cluster and are running into some issues? We've collected some common +stumbling blocks and solutions in our [Troubleshooting guide](troubleshooting.md) to help get you moving. http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/operations/storage.md ---------------------------------------------------------------------- diff --git a/docs/operations/storage.md b/docs/operations/storage.md index c30922f..8db6f6f 100644 --- a/docs/operations/storage.md +++ b/docs/operations/storage.md @@ -1,8 +1,6 @@ # Aurora Scheduler Storage - [Overview](#overview) -- [Replicated Log Configuration](#replicated-log-configuration) -- [Backup Configuration](#replicated-log-configuration) - [Storage Semantics](#storage-semantics) - [Reads, writes, modifications](#reads-writes-modifications) - [Read lifecycle](#read-lifecycle) @@ -21,8 +19,9 @@ For example: * Production resource quotas * Mesos resource offer host attributes -Aurora solves its persistence needs by leveraging the Mesos implementation of a Paxos replicated -log [[1]](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf) +Aurora solves its persistence needs by leveraging the +[Mesos implementation of a Paxos replicated log](http://mesos.apache.org/documentation/latest/replicated-log-internals/) +[[1]](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf) [[2]](http://en.wikipedia.org/wiki/State_machine_replication) with a key-value [LevelDB](https://github.com/google/leveldb) storage as persistence media. http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/operations/troubleshooting.md ---------------------------------------------------------------------- diff --git a/docs/operations/troubleshooting.md b/docs/operations/troubleshooting.md new file mode 100644 index 0000000..3a6d23b --- /dev/null +++ b/docs/operations/troubleshooting.md @@ -0,0 +1,106 @@ +# Troubleshooting + +So you've started your first cluster and are running into some issues? We've collected some common +stumbling blocks and solutions here to help get you moving. + +## Replicated log not initialized + +### Symptoms +- Scheduler RPCs and web interface claim `Storage is not READY` +- Scheduler log repeatedly prints messages like + + ``` + I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status + received a broadcasted recover request + I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response + from a replica in EMPTY status + ``` + +### Solution +When you create a new cluster, you need to inform a quorum of schedulers that they are safe to +consider their database to be empty by [initializing](installation.md#finalizing) the +replicated log. This is done to prevent the scheduler from modifying the cluster state in the event +of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path. + + +## No distinct leader elected + +### Symptoms +Either no scheduler or multiple scheduler believe to be leading. + +### Solution +Verify the [network configuration](configuration.md#network-configuration) of the Aurora +scheduler is correct: + +* The `LIBPROCESS_IP:LIBPROCESS_PORT` endpoints must be reachable from all coordinator nodes running + a scheduler or a Mesos master. +* Hostname lookups have to resolve to public ips rather than local ones that cannot be reached + from another node. + +In addition, double-check the [quota settings](configuration.md#replicated-log-configuration) of the +replicated log. + + +## Scheduler not registered + +### Symptoms +Scheduler log contains + + Framework has not been registered within the tolerated delay. + +### Solution +Double-check that the scheduler is configured correctly to reach the Mesos master. If you are registering +the master in ZooKeeper, make sure command line argument to the master: + + --zk=zk://$ZK_HOST:2181/mesos/master + +is the same as the one on the scheduler: + + -mesos_master_address=zk://$ZK_HOST:2181/mesos/master + + +## Scheduler not running + +### Symptoms +The scheduler process commits suicide regularly. This happens under error conditions, but +also on purpose in regular intervals. + +### Solution +Aurora is meant to be run under supervision. You have to configure a supervisor like +[Monit](http://mmonit.com/monit/), [supervisord](http://supervisord.org/), or systemd to run the +scheduler and restart it whenever it fails or exists on purpose. + +Aurora supports an active health checking protocol on its admin HTTP interface - if a `GET /health` +times out or returns anything other than `200 OK` the scheduler process is unhealthy and should be +restarted. + +For example, monit can be configured with + + if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart + +assuming you set `-http_port=8081`. + + +## Executor crashing or hanging + +### Symptoms +Launched task instances never transition to `STARTING` or `RUNNING` but immediately transition +to `FAILED` or `LOST`. + +### Solution +The executor might be failing due to unknown internal errors such as a missing native dependency +of the Mesos executor library. Open the Mesos UI and navigate to the failing +task in question. Inspect the various log files in order to learn about what is going on. + + +## Observer does not discover tasks + +### Symptoms +The observer UI does not list any tasks. When navigating from the scheduler UI to the state of +a particular task instance the observer returns `Error: 404 Not Found`. + +### Solution +The observer is refreshing its internal state every couple of seconds. If waiting a few seconds +does not resolve the issue, check that the `--mesos-root` setting of the observer and the +`--work_dir` option of the Mesos agent are in sync. For details, see our +[Install instructions](installation.md#worker-configuration). http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/operations/upgrades.md ---------------------------------------------------------------------- diff --git a/docs/operations/upgrades.md b/docs/operations/upgrades.md new file mode 100644 index 0000000..1d6a73d --- /dev/null +++ b/docs/operations/upgrades.md @@ -0,0 +1,41 @@ +# Upgrading Aurora + +Aurora can be updated from one version to the next without any downtime or restarts of running +jobs. The same holds true for Mesos. + +Generally speaking, Mesos and Aurora strive for a +1/-1 version compatibility, i.e. all components +are meant to be forward and backwards compatible for at least one version. This implies it +does not really matter in which order updates are carried out. + +Exceptions to this rule are documented in the [Aurora release-notes](../../RELEASE-NOTES.md) +and the [Mesos upgrade instructions](https://mesos.apache.org/documentation/latest/upgrades/). + + +## Instructions + +To upgrade Aurora, follow these steps: + +1. Update the first scheduler instance by updating its software and restarting its process. +2. Wait until the scheduler is up and its [Replicated Log](configuration.md#replicated-log-configuration) + caught up with the other schedulers in the cluster. The log has caught up if `log/recovered` has + the value `1`. You can check the metric via `curl LIBPROCESS_IP:LIBPROCESS_PORT/metrics/snapshot`, + where ip and port refer to the [libmesos configuration](configuration.md#network-configuration) + settings of the scheduler instance. +3. Proceed with the next scheduler until all instances are updated. +4. Update the Aurora executor deployed to the compute nodes of your cluster. Jobs will continue + running with the old version of the executor, and will only be launched by the new one once + they are restarted eventually due to natural cluster churn. +5. Distribute the new Aurora client to your users. + + +## Best Practices + +Even though not absolutely mandatory, we advice to adhere to the following rules: + +* Never skip any major or minor releases when updating. If you have to catch up several releases you + have to deploy all intermediary versions. Skipping bugfix releases is acceptable though. +* Verify all updates on a test cluster before touching your production deployments. +* To minimize the number of failovers during updates, update the currently leading scheduler + instance last. +* Update the Aurora executor on a subset of compute nodes as a canary before deploying the change to + the whole fleet. http://git-wip-us.apache.org/repos/asf/aurora/blob/c85bffdd/docs/reference/scheduler-endpoints.md ---------------------------------------------------------------------- diff --git a/docs/reference/scheduler-endpoints.md b/docs/reference/scheduler-endpoints.md index d302e90..ddae76b 100644 --- a/docs/reference/scheduler-endpoints.md +++ b/docs/reference/scheduler-endpoints.md @@ -1,7 +1,7 @@ # HTTP endpoints There are a number of HTTP endpoints that the Aurora scheduler exposes. These allow various -operational tasks to be performed on the scheduler. Below is the list of all such endpoints +operational tasks to be performed on the scheduler. Below is an (incomplete) list of such endpoints and a brief explanation of what they do. ## Leader health @@ -12,8 +12,8 @@ HAProxy or AWS ELB. When a HTTP GET request is issued on this endpoint, it responds as follows: - If the instance that received the GET request is the leading scheduler, a HTTP status code of - 200 (OK) is returned. + `200 OK` is returned. - If the instance that received the GET request is not the leading scheduler but a leader does - exist, a HTTP status code of 503 (SERVICE_UNAVAILABLE) is returned. -- If no leader currently exists or the leader is unknown, a HTTP status code of 502 - (BAD_GATEWAY) is returned. \ No newline at end of file + exist, a HTTP status code of `503 SERVICE_UNAVAILABLE` is returned. +- If no leader currently exists or the leader is unknown, a HTTP status code of `502 BAD_GATEWAY` + is returned.
