Repository: aurora Updated Branches: refs/heads/master 140d74d65 -> 7c7dcb265
Replace incorrect/misleading use of constraints with best practices doc. Reviewed at https://reviews.apache.org/r/38302/ Project: http://git-wip-us.apache.org/repos/asf/aurora/repo Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/7c7dcb26 Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/7c7dcb26 Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/7c7dcb26 Branch: refs/heads/master Commit: 7c7dcb26593baf9e5941de40e34d4aa4fe1ab95c Parents: 140d74d Author: Bill Farner <wfar...@apache.org> Authored: Fri Sep 11 11:00:30 2015 -0700 Committer: Bill Farner <wfar...@apache.org> Committed: Fri Sep 11 11:00:44 2015 -0700 ---------------------------------------------------------------------- docs/deploying-aurora-scheduler.md | 33 ++++++++++---------- examples/vagrant/upstart/mesos-slave.conf | 1 - .../apache/aurora/e2e/http/http_example.aurora | 4 --- .../aurora/e2e/http/http_example_updated.aurora | 9 ++---- 4 files changed, 18 insertions(+), 29 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/aurora/blob/7c7dcb26/docs/deploying-aurora-scheduler.md ---------------------------------------------------------------------- diff --git a/docs/deploying-aurora-scheduler.md b/docs/deploying-aurora-scheduler.md index 73f7b19..8db0e61 100644 --- a/docs/deploying-aurora-scheduler.md +++ b/docs/deploying-aurora-scheduler.md @@ -21,6 +21,8 @@ machines. This guide helps you get the scheduler set up and troubleshoot some c - [Dedicated attribute](#dedicated-attribute) - [Syntax](#syntax) - [Example](#example) +- [Best practices](#best-practices) + - [Diversity](#diversity) - [Common problems](#common-problems) - [Replicated log not initialized](#replicated-log-not-initialized) - [Symptoms](#symptoms) @@ -28,9 +30,6 @@ machines. This guide helps you get the scheduler set up and troubleshoot some c - [Scheduler not registered](#scheduler-not-registered) - [Symptoms](#symptoms-1) - [Solution](#solution-1) - - [Tasks are stuck in PENDING forever](#tasks-are-stuck-in-pending-forever) - - [Symptoms](#symptoms-2) - - [Solution](#solution-2) - [Changing Scheduler Quorum Size](#changing-scheduler-quorum-size) - [Preparation](#preparation) - [Adding New Schedulers](#adding-new-schedulers) @@ -220,7 +219,7 @@ enforce this. ##### Example Consider the following slave command line: - mesos-slave --attributes="host:$HOST;rack:$RACK;dedicated:db_team/redis" ... + mesos-slave --attributes="dedicated:db_team/redis" ... And this job configuration: @@ -237,6 +236,19 @@ The job configuration is indicating that it should only be scheduled on slaves w `dedicated:db_team/redis`. Additionally, Aurora will prevent any tasks that do _not_ have that constraint from running on those slaves. +## Best practices +### Diversity +Data centers are often organized with hierarchical failure domains. Common failure domains +include hosts, racks, rows, and PDUs. If you have this information available, it is wise to tag +the mesos-slave with them as +[attributes](https://mesos.apache.org/documentation/attributes-resources/). + +When it comes time to schedule jobs, Aurora will automatically spread them across the failure +domains as specified in the +[job configuration](configuration-reference.md#specifying-scheduling-constraints). + +Note: in virtualized environments like EC2, the only attribute that usually makes sense for this +purpose is `host`. ## Common problems So you've started your first cluster and are running into some issues? We've collected some common @@ -278,19 +290,6 @@ is the same as the one on the scheduler: -mesos_master_address=zk://$ZK_HOST:2181/mesos/master -### Tasks are stuck in `PENDING` forever - -#### Symptoms -The scheduler is registered, and [receiving offers](monitoring.md#scheduler_resource_offers), -but tasks are perpetually shown as `PENDING - Constraint not satisfied: host`. - -#### Solution -Check that your slaves are configured with `host` and `rack` attributes. Aurora requires that -slaves are tagged with these two common failure domains to ensure that it can safely place tasks -such that jobs are resilient to failure. - -See our [vagrant example](examples/vagrant/upstart/mesos-slave.conf) for details. - ## Changing Scheduler Quorum Size Special care needs to be taken when changing the size of the Aurora scheduler quorum. Since Aurora uses a Mesos replicated log, similar steps need to be followed as when http://git-wip-us.apache.org/repos/asf/aurora/blob/7c7dcb26/examples/vagrant/upstart/mesos-slave.conf ---------------------------------------------------------------------- diff --git a/examples/vagrant/upstart/mesos-slave.conf b/examples/vagrant/upstart/mesos-slave.conf index 9af680e..1ef059b 100644 --- a/examples/vagrant/upstart/mesos-slave.conf +++ b/examples/vagrant/upstart/mesos-slave.conf @@ -26,7 +26,6 @@ env ZK_HOST=192.168.33.7 exec /usr/sbin/mesos-slave --master=zk://$ZK_HOST:2181/mesos/master \ --ip=$MY_HOST \ --hostname=$MY_HOST \ - --attributes="host:$MY_HOST;rack:a" \ --resources="cpus:4;mem:1024;disk:20000" \ --work_dir="/var/lib/mesos" \ --containerizers=docker,mesos \ http://git-wip-us.apache.org/repos/asf/aurora/blob/7c7dcb26/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora ---------------------------------------------------------------------- diff --git a/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora b/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora index d7bf108..dc55109 100644 --- a/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora +++ b/src/test/sh/org/apache/aurora/e2e/http/http_example.aurora @@ -43,10 +43,6 @@ job = Service( role = getpass.getuser(), environment = 'test', contact = '{{role}}@localhost', - # Since there is only one slave in devcluster allow all instances to run there. - constraints = { - 'host': 'limit:2', - }, announce = Announcer(), ) http://git-wip-us.apache.org/repos/asf/aurora/blob/7c7dcb26/src/test/sh/org/apache/aurora/e2e/http/http_example_updated.aurora ---------------------------------------------------------------------- diff --git a/src/test/sh/org/apache/aurora/e2e/http/http_example_updated.aurora b/src/test/sh/org/apache/aurora/e2e/http/http_example_updated.aurora index c973966..f098de9 100644 --- a/src/test/sh/org/apache/aurora/e2e/http/http_example_updated.aurora +++ b/src/test/sh/org/apache/aurora/e2e/http/http_example_updated.aurora @@ -25,11 +25,10 @@ stage_server = Process( cmdline = '{{cmd}}' ) -test_task = Task( +test_task = SequentialTask( name = 'http_example', resources = Resources(cpu=0.5, ram=34*MB, disk=64*MB), - processes = [stage_server, run_server], - constraints = order(stage_server, run_server)) + processes = [stage_server, run_server]) update_config = UpdateConfig(watch_secs=10, batch_size=3) health_check_config = HealthCheckConfig(initial_interval_secs=5, interval_secs=1) @@ -43,10 +42,6 @@ job = Service( role = getpass.getuser(), environment = 'test', contact = '{{role}}@localhost', - # Since there is only one slave in devcluster allow all instances to run there. - constraints = { - 'host': 'limit:4', - }, announce = Announcer(), )