Added: aurora/site/source/documentation/0.12.0/configuration-tutorial.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/configuration-tutorial.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/configuration-tutorial.md (added) +++ aurora/site/source/documentation/0.12.0/configuration-tutorial.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,954 @@ +Aurora Configuration Tutorial +============================= + +How to write Aurora configuration files, including feature descriptions +and best practices. When writing a configuration file, make use of +`aurora job inspect`. It takes the same job key and configuration file +arguments as `aurora job create` or `aurora update start`. It first ensures the +configuration parses, then outputs it in human-readable form. + +You should read this after going through the general [Aurora Tutorial](/documentation/0.12.0/tutorial/). + +- [Aurora Configuration Tutorial](#aurora-configuration-tutorial) + - [The Basics](#the-basics) + - [Use Bottom-To-Top Object Ordering](#use-bottom-to-top-object-ordering) + - [An Example Configuration File](#an-example-configuration-file) + - [Defining Process Objects](#defining-process-objects) + - [Getting Your Code Into The Sandbox](#getting-your-code-into-the-sandbox) + - [Defining Task Objects](#defining-task-objects) + - [SequentialTask: Running Processes in Parallel or Sequentially](#sequentialtask-running-processes-in-parallel-or-sequentially) + - [SimpleTask](#simpletask) + - [Combining tasks](#combining-tasks) + - [Defining Job Objects](#defining-job-objects) + - [The jobs List](#the-jobs-list) + - [Templating](#templating) + - [Templating 1: Binding in Pystachio](#templating-1-binding-in-pystachio) + - [Structurals in Pystachio / Aurora](#structurals-in-pystachio--aurora) + - [Mustaches Within Structurals](#mustaches-within-structurals) + - [Templating 2: Structurals Are Factories](#templating-2-structurals-are-factories) + - [A Second Way of Templating](#a-second-way-of-templating) + - [Advanced Binding](#advanced-binding) + - [Bind Syntax](#bind-syntax) + - [Binding Complex Objects](#binding-complex-objects) + - [Lists](#lists) + - [Maps](#maps) + - [Structurals](#structurals) + - [Structural Binding](#structural-binding) + - [Configuration File Writing Tips And Best Practices](#configuration-file-writing-tips-and-best-practices) + - [Use As Few .aurora Files As Possible](#use-as-few-aurora-files-as-possible) + - [Avoid Boilerplate](#avoid-boilerplate) + - [Thermos Uses bash, But Thermos Is Not bash](#thermos-uses-bash-but-thermos-is-not-bash) + - [Bad](#bad) + - [Good](#good) + - [Rarely Use Functions In Your Configurations](#rarely-use-functions-in-your-configurations) + - [Bad](#bad-1) + - [Good](#good-1) + +The Basics +---------- + +To run a job on Aurora, you must specify a configuration file that tells +Aurora what it needs to know to schedule the job, what Mesos needs to +run the tasks the job is made up of, and what Thermos needs to run the +processes that make up the tasks. This file must have +a`.aurora` suffix. + +A configuration file defines a collection of objects, along with parameter +values for their attributes. An Aurora configuration file contains the +following three types of objects: + +- Job +- Task +- Process + +A configuration also specifies a list of `Job` objects assigned +to the variable `jobs`. + +- jobs (list of defined Jobs to run) + +The `.aurora` file format is just Python. However, `Job`, `Task`, +`Process`, and other classes are defined by a type-checked dictionary +templating library called *Pystachio*, a powerful tool for +configuration specification and reuse. Pystachio objects are tailored +via {{}} surrounded templates. + +When writing your `.aurora` file, you may use any Pystachio datatypes, as +well as any objects shown in the [*Aurora+Thermos Configuration +Reference*](/documentation/0.12.0/configuration-reference/), without `import` statements - the +Aurora config loader injects them automatically. Other than that, an `.aurora` +file works like any other Python script. + +[*Aurora+Thermos Configuration Reference*](/documentation/0.12.0/configuration-reference/) +has a full reference of all Aurora/Thermos defined Pystachio objects. + +### Use Bottom-To-Top Object Ordering + +A well-structured configuration starts with structural templates (if +any). Structural templates encapsulate in their attributes all the +differences between Jobs in the configuration that are not directly +manipulated at the `Job` level, but typically at the `Process` or `Task` +level. For example, if certain processes are invoked with slightly +different settings or input. + +After structural templates, define, in order, `Process`es, `Task`s, and +`Job`s. + +Structural template names should be *UpperCamelCased* and their +instantiations are typically *UPPER\_SNAKE\_CASED*. `Process`, `Task`, +and `Job` names are typically *lower\_snake\_cased*. Indentation is typically 2 +spaces. + +An Example Configuration File +----------------------------- + +The following is a typical configuration file. Don't worry if there are +parts you don't understand yet, but you may want to refer back to this +as you read about its individual parts. Note that names surrounded by +curly braces {{}} are template variables, which the system replaces with +bound values for the variables. + + # --- templates here --- + class Profile(Struct): + package_version = Default(String, 'live') + java_binary = Default(String, '/usr/lib/jvm/java-1.7.0-openjdk/bin/java') + extra_jvm_options = Default(String, '') + parent_environment = Default(String, 'prod') + parent_serverset = Default(String, + '/foocorp/service/bird/{{parent_environment}}/bird') + + # --- processes here --- + main = Process( + name = 'application', + cmdline = '{{profile.java_binary}} -server -Xmx1792m ' + '{{profile.extra_jvm_options}} ' + '-jar application.jar ' + '-upstreamService {{profile.parent_serverset}}' + ) + + # --- tasks --- + base_task = SequentialTask( + name = 'application', + processes = [ + Process( + name = 'fetch', + cmdline = 'curl -O + https://packages.foocorp.com/{{profile.package_version}}/application.jar'), + ] + ) + + # not always necessary but often useful to have separate task + # resource classes + staging_task = base_task(resources = + Resources(cpu = 1.0, + ram = 2048*MB, + disk = 1*GB)) + production_task = base_task(resources = + Resources(cpu = 4.0, + ram = 2560*MB, + disk = 10*GB)) + + # --- job template --- + job_template = Job( + name = 'application', + role = 'myteam', + contact = '[email protected]', + instances = 20, + service = True, + task = production_task + ) + + # -- profile instantiations (if any) --- + PRODUCTION = Profile() + STAGING = Profile( + extra_jvm_options = '-Xloggc:gc.log', + parent_environment = 'staging' + ) + + # -- job instantiations -- + jobs = [ + job_template(cluster = 'cluster1', environment = 'prod') + .bind(profile = PRODUCTION), + + job_template(cluster = 'cluster2', environment = 'prod') + .bind(profile = PRODUCTION), + + job_template(cluster = 'cluster1', + environment = 'staging', + service = False, + task = staging_task, + instances = 2) + .bind(profile = STAGING), + ] + +## Defining Process Objects + +Processes are handled by the Thermos system. A process is a single +executable step run as a part of an Aurora task, which consists of a +bash-executable statement. + +The key (and required) `Process` attributes are: + +- `name`: Any string which is a valid Unix filename (no slashes, + NULLs, or leading periods). The `name` value must be unique relative + to other Processes in a `Task`. +- `cmdline`: A command line run in a bash subshell, so you can use + bash scripts. Nothing is supplied for command-line arguments, + so `$*` is unspecified. + +Many tiny processes make managing configurations more difficult. For +example, the following is a bad way to define processes. + + copy = Process( + name = 'copy', + cmdline = 'curl -O https://packages.foocorp.com/app.zip' + ) + unpack = Process( + name = 'unpack', + cmdline = 'unzip app.zip' + ) + remove = Process( + name = 'remove', + cmdline = 'rm -f app.zip' + ) + run = Process( + name = 'app', + cmdline = 'java -jar app.jar' + ) + run_task = Task( + processes = [copy, unpack, remove, run], + constraints = order(copy, unpack, remove, run) + ) + +Since `cmdline` runs in a bash subshell, you can chain commands +with `&&` or `||`. + +When defining a `Task` that is just a list of Processes run in a +particular order, use `SequentialTask`, as described in the [*Defining* +`Task` *Objects*](#Task) section. The following simplifies and combines the +above multiple `Process` definitions into just two. + + stage = Process( + name = 'stage', + cmdline = 'curl -O https://packages.foocorp.com/app.zip && ' + 'unzip app.zip && rm -f app.zip') + + run = Process(name = 'app', cmdline = 'java -jar app.jar') + + run_task = SequentialTask(processes = [stage, run]) + +`Process` also has optional attributes to customize its behaviour. Details can be found in the [*Aurora+Thermos Configuration Reference*](/documentation/0.12.0/configuration-reference/#process-objects). + + +## Getting Your Code Into The Sandbox + +When using Aurora, you need to get your executable code into its "sandbox", specifically +the Task sandbox where the code executes for the Processes that make up that Task. + +Each Task has a sandbox created when the Task starts and garbage +collected when it finishes. All of a Task's processes run in its +sandbox, so processes can share state by using a shared current +working directory. + +Typically, you save this code somewhere. You then need to define a Process +in your `.aurora` configuration file that fetches the code from that somewhere +to where the slave can see it. For a public cloud, that can be anywhere public on +the Internet, such as S3. For a private cloud internal storage, you need to put in +on an accessible HDFS cluster or similar storage. + +The template for this Process is: + + <name> = Process( + name = '<name>' + cmdline = '<command to copy and extract code archive into current working directory>' + ) + +Note: Be sure the extracted code archive has an executable. + +## Defining Task Objects + +Tasks are handled by Mesos. A task is a collection of processes that +runs in a shared sandbox. It's the fundamental unit Aurora uses to +schedule the datacenter; essentially what Aurora does is find places +in the cluster to run tasks. + +The key (and required) parts of a Task are: + +- `name`: A string giving the Task's name. By default, if a Task is + not given a name, it inherits the first name in its Process list. + +- `processes`: An unordered list of Process objects bound to the Task. + The value of the optional `constraints` attribute affects the + contents as a whole. Currently, the only constraint, `order`, determines if + the processes run in parallel or sequentially. + +- `resources`: A `Resource` object defining the Task's resource + footprint. A `Resource` object has three attributes: + - `cpu`: A Float, the fractional number of cores the Task + requires. + - `ram`: An Integer, RAM bytes the Task requires. + - `disk`: An integer, disk bytes the Task requires. + +A basic Task definition looks like: + + Task( + name="hello_world", + processes=[Process(name = "hello_world", cmdline = "echo hello world")], + resources=Resources(cpu = 1.0, + ram = 1*GB, + disk = 1*GB)) + +A Task has optional attributes to customize its behaviour. Details can be found in the [*Aurora+Thermos Configuration Reference*](/documentation/0.12.0/configuration-reference/#task-object) + + +### SequentialTask: Running Processes in Parallel or Sequentially + +By default, a Task with several Processes runs them in parallel. There +are two ways to run Processes sequentially: + +- Include an `order` constraint in the Task definition's `constraints` + attribute whose arguments specify the processes' run order: + + Task( ... processes=[process1, process2, process3], + constraints = order(process1, process2, process3), ...) + +- Use `SequentialTask` instead of `Task`; it automatically runs + processes in the order specified in the `processes` attribute. No + `constraint` parameter is needed: + + SequentialTask( ... processes=[process1, process2, process3] ...) + +### SimpleTask + +For quickly creating simple tasks, use the `SimpleTask` helper. It +creates a basic task from a provided name and command line using a +default set of resources. For example, in a .`aurora` configuration +file: + + SimpleTask(name="hello_world", command="echo hello world") + +is equivalent to + + Task(name="hello_world", + processes=[Process(name = "hello_world", cmdline = "echo hello world")], + resources=Resources(cpu = 1.0, + ram = 1*GB, + disk = 1*GB)) + +The simplest idiomatic Job configuration thus becomes: + + import os + hello_world_job = Job( + task=SimpleTask(name="hello_world", command="echo hello world"), + role=os.getenv('USER'), + cluster="cluster1") + +When written to `hello_world.aurora`, you invoke it with a simple +`aurora job create cluster1/$USER/test/hello_world hello_world.aurora`. + +### Combining tasks + +`Tasks.concat`(synonym,`concat_tasks`) and +`Tasks.combine`(synonym,`combine_tasks`) merge multiple Task definitions +into a single Task. It may be easier to define complex Jobs +as smaller constituent Tasks. But since a Job only includes a single +Task, the subtasks must be combined before using them in a Job. +Smaller Tasks can also be reused between Jobs, instead of having to +repeat their definition for multiple Jobs. + +With both methods, the merged Task takes the first Task's name. The +difference between the two is the result Task's process ordering. + +- `Tasks.combine` runs its subtasks' processes in no particular order. + The new Task's resource consumption is the sum of all its subtasks' + consumption. + +- `Tasks.concat` runs its subtasks in the order supplied, with each + subtask's processes run serially between tasks. It is analogous to + the `order` constraint helper, except at the Task level instead of + the Process level. The new Task's resource consumption is the + maximum value specified by any subtask for each Resource attribute + (cpu, ram and disk). + +For example, given the following: + + setup_task = Task( + ... + processes=[download_interpreter, update_zookeeper], + # It is important to note that {{Tasks.concat}} has + # no effect on the ordering of the processes within a task; + # hence the necessity of the {{order}} statement below + # (otherwise, the order in which {{download_interpreter}} + # and {{update_zookeeper}} run will be non-deterministic) + constraints=order(download_interpreter, update_zookeeper), + ... + ) + + run_task = SequentialTask( + ... + processes=[download_application, start_application], + ... + ) + + combined_task = Tasks.concat(setup_task, run_task) + +The `Tasks.concat` command merges the two Tasks into a single Task and +ensures all processes in `setup_task` run before the processes +in `run_task`. Conceptually, the task is reduced to: + + task = Task( + ... + processes=[download_interpreter, update_zookeeper, + download_application, start_application], + constraints=order(download_interpreter, update_zookeeper, + download_application, start_application), + ... + ) + +In the case of `Tasks.combine`, the two schedules run in parallel: + + task = Task( + ... + processes=[download_interpreter, update_zookeeper, + download_application, start_application], + constraints=order(download_interpreter, update_zookeeper) + + order(download_application, start_application), + ... + ) + +In the latter case, each of the two sequences may operate in parallel. +Of course, this may not be the intended behavior (for example, if +the `start_application` Process implicitly relies +upon `download_interpreter`). Make sure you understand the difference +between using one or the other. + +## Defining Job Objects + +A job is a group of identical tasks that Aurora can run in a Mesos cluster. + +A `Job` object is defined by the values of several attributes, some +required and some optional. The required attributes are: + +- `task`: Task object to bind to this job. Note that a Job can + only take a single Task. + +- `role`: Job's role account; in other words, the user account to run + the job as on a Mesos cluster machine. A common value is + `os.getenv('USER')`; using a Python command to get the user who + submits the job request. The other common value is the service + account that runs the job, e.g. `www-data`. + +- `environment`: Job's environment, typical values + are `devel`, `test`, or `prod`. + +- `cluster`: Aurora cluster to schedule the job in, defined in + `/etc/aurora/clusters.json` or `~/.clusters.json`. You can specify + jobs where the only difference is the `cluster`, then at run time + only run the Job whose job key includes your desired cluster's name. + +You usually see a `name` parameter. By default, `name` inherits its +value from the Job's associated Task object, but you can override this +default. For these four parameters, a Job definition might look like: + + foo_job = Job( name = 'foo', cluster = 'cluster1', + role = os.getenv('USER'), environment = 'prod', + task = foo_task) + +In addition to the required attributes, there are several optional +attributes. Details can be found in the [Aurora+Thermos Configuration Reference](/documentation/0.12.0/configuration-reference/#job-objects). + + +## The jobs List + +At the end of your `.aurora` file, you need to specify a list of the +file's defined Jobs. For example, the following exports the jobs `job1`, +`job2`, and `job3`. + + jobs = [job1, job2, job3] + +This allows the aurora client to invoke commands on those jobs, such as +starting, updating, or killing them. + +Templating +---------- + +The `.aurora` file format is just Python. However, `Job`, `Task`, +`Process`, and other classes are defined by a templating library called +*Pystachio*, a powerful tool for configuration specification and reuse. + +[Aurora+Thermos Configuration Reference](/documentation/0.12.0/configuration-reference/) +has a full reference of all Aurora/Thermos defined Pystachio objects. + +When writing your `.aurora` file, you may use any Pystachio datatypes, as +well as any objects shown in the *Aurora+Thermos Configuration +Reference* without `import` statements - the Aurora config loader +injects them automatically. Other than that the `.aurora` format +works like any other Python script. + +### Templating 1: Binding in Pystachio + +Pystachio uses the visually distinctive {{}} to indicate template +variables. These are often called "mustache variables" after the +similarly appearing variables in the Mustache templating system and +because the curly braces resemble mustaches. + +If you are familiar with the Mustache system, templates in Pystachio +have significant differences. They have no nesting, joining, or +inheritance semantics. On the other hand, when evaluated, templates +are evaluated iteratively, so this affords some level of indirection. + +Let's start with the simplest template; text with one +variable, in this case `name`; + + Hello {{name}} + +If we evaluate this as is, we'd get back: + + Hello + +If a template variable doesn't have a value, when evaluated it's +replaced with nothing. If we add a binding to give it a value: + + { "name" : "Tom" } + +We'd get back: + + Hello Tom + +Every Pystachio object has an associated `.bind` method that can bind +values to {{}} variables. Bindings are not immediately evaluated. +Instead, they are evaluated only when the interpolated value of the +object is necessary, e.g. for performing equality or serializing a +message over the wire. + +Objects with and without mustache templated variables behave +differently: + + >>> Float(1.5) + Float(1.5) + + >>> Float('{{x}}.5') + Float({{x}}.5) + + >>> Float('{{x}}.5').bind(x = 1) + Float(1.5) + + >>> Float('{{x}}.5').bind(x = 1) == Float(1.5) + True + + >>> contextual_object = String('{{metavar{{number}}}}').bind( + ... metavar1 = "first", metavar2 = "second") + + >>> contextual_object + String({{metavar{{number}}}}) + + >>> contextual_object.bind(number = 1) + String(first) + + >>> contextual_object.bind(number = 2) + String(second) + +You usually bind simple key to value pairs, but you can also bind three +other objects: lists, dictionaries, and structurals. These will be +described in detail later. + +### Structurals in Pystachio / Aurora + +Most Aurora/Thermos users don't ever (knowingly) interact with `String`, +`Float`, or `Integer` Pystashio objects directly. Instead they interact +with derived structural (`Struct`) objects that are collections of +fundamental and structural objects. The structural object components are +called *attributes*. Aurora's most used structural objects are `Job`, +`Task`, and `Process`: + + class Process(Struct): + cmdline = Required(String) + name = Required(String) + max_failures = Default(Integer, 1) + daemon = Default(Boolean, False) + ephemeral = Default(Boolean, False) + min_duration = Default(Integer, 5) + final = Default(Boolean, False) + +Construct default objects by following the object's type with (). If you +want an attribute to have a value different from its default, include +the attribute name and value inside the parentheses. + + >>> Process() + Process(daemon=False, max_failures=1, ephemeral=False, + min_duration=5, final=False) + +Attribute values can be template variables, which then receive specific +values when creating the object. + + >>> Process(cmdline = 'echo {{message}}') + Process(daemon=False, max_failures=1, ephemeral=False, min_duration=5, + cmdline=echo {{message}}, final=False) + + >>> Process(cmdline = 'echo {{message}}').bind(message = 'hello world') + Process(daemon=False, max_failures=1, ephemeral=False, min_duration=5, + cmdline=echo hello world, final=False) + +A powerful binding property is that all of an object's children inherit its +bindings: + + >>> List(Process)([ + ... Process(name = '{{prefix}}_one'), + ... Process(name = '{{prefix}}_two') + ... ]).bind(prefix = 'hello') + ProcessList( + Process(daemon=False, name=hello_one, max_failures=1, ephemeral=False, min_duration=5, final=False), + Process(daemon=False, name=hello_two, max_failures=1, ephemeral=False, min_duration=5, final=False) + ) + +Remember that an Aurora Job contains Tasks which contain Processes. A +Job level binding is inherited by its Tasks and all their Processes. +Similarly a Task level binding is available to that Task and its +Processes but is *not* visible at the Job level (inheritance is a +one-way street.) + +#### Mustaches Within Structurals + +When you define a `Struct` schema, one powerful, but confusing, feature +is that all of that structure's attributes are Mustache variables within +the enclosing scope *once they have been populated*. + +For example, when `Process` is defined above, all its attributes such as +{{`name`}}, {{`cmdline`}}, {{`max_failures`}} etc., are all immediately +defined as Mustache variables, implicitly bound into the `Process`, and +inherit all child objects once they are defined. + +Thus, you can do the following: + + >>> Process(name = "installer", cmdline = "echo {{name}} is running") + Process(daemon=False, name=installer, max_failures=1, ephemeral=False, min_duration=5, + cmdline=echo installer is running, final=False) + +WARNING: This binding only takes place in one direction. For example, +the following does NOT work and does not set the `Process` `name` +attribute's value. + + >>> Process().bind(name = "installer") + Process(daemon=False, max_failures=1, ephemeral=False, min_duration=5, final=False) + +The following is also not possible and results in an infinite loop that +attempts to resolve `Process.name`. + + >>> Process(name = '{{name}}').bind(name = 'installer') + +Do not confuse Structural attributes with bound Mustache variables. +Attributes are implicitly converted to Mustache variables but not vice +versa. + +### Templating 2: Structurals Are Factories + +#### A Second Way of Templating + +A second templating method is both as powerful as the aforementioned and +often confused with it. This method is due to automatic conversion of +Struct attributes to Mustache variables as described above. + +Suppose you create a Process object: + + >>> p = Process(name = "process_one", cmdline = "echo hello world") + + >>> p + Process(daemon=False, name=process_one, max_failures=1, ephemeral=False, min_duration=5, + cmdline=echo hello world, final=False) + +This `Process` object, "`p`", can be used wherever a `Process` object is +needed. It can also be reused by changing the value(s) of its +attribute(s). Here we change its `name` attribute from `process_one` to +`process_two`. + + >>> p(name = "process_two") + Process(daemon=False, name=process_two, max_failures=1, ephemeral=False, min_duration=5, + cmdline=echo hello world, final=False) + +Template creation is a common use for this technique: + + >>> Daemon = Process(daemon = True) + >>> logrotate = Daemon(name = 'logrotate', cmdline = './logrotate conf/logrotate.conf') + >>> mysql = Daemon(name = 'mysql', cmdline = 'bin/mysqld --safe-mode') + +### Advanced Binding + +As described above, `.bind()` binds simple strings or numbers to +Mustache variables. In addition to Structural types formed by combining +atomic types, Pystachio has two container types; `List` and `Map` which +can also be bound via `.bind()`. + +#### Bind Syntax + +The `bind()` function can take Python dictionaries or `kwargs` +interchangeably (when "`kwargs`" is in a function definition, `kwargs` +receives a Python dictionary containing all keyword arguments after the +formal parameter list). + + >>> String('{{foo}}').bind(foo = 'bar') == String('{{foo}}').bind({'foo': 'bar'}) + True + +Bindings done "closer" to the object in question take precedence: + + >>> p = Process(name = '{{context}}_process') + >>> t = Task().bind(context = 'global') + >>> t(processes = [p, p.bind(context = 'local')]) + Task(processes=ProcessList( + Process(daemon=False, name=global_process, max_failures=1, ephemeral=False, final=False, + min_duration=5), + Process(daemon=False, name=local_process, max_failures=1, ephemeral=False, final=False, + min_duration=5) + )) + +#### Binding Complex Objects + +##### Lists + + >>> fibonacci = List(Integer)([1, 1, 2, 3, 5, 8, 13]) + >>> String('{{fib[4]}}').bind(fib = fibonacci) + String(5) + +##### Maps + + >>> first_names = Map(String, String)({'Kent': 'Clark', 'Wayne': 'Bruce', 'Prince': 'Diana'}) + >>> String('{{first[Kent]}}').bind(first = first_names) + String(Clark) + +##### Structurals + + >>> String('{{p.cmdline}}').bind(p = Process(cmdline = "echo hello world")) + String(echo hello world) + +### Structural Binding + +Use structural templates when binding more than two or three individual +values at the Job or Task level. For fewer than two or three, standard +key to string binding is sufficient. + +Structural binding is a very powerful pattern and is most useful in +Aurora/Thermos for doing Structural configuration. For example, you can +define a job profile. The following profile uses `HDFS`, the Hadoop +Distributed File System, to designate a file's location. `HDFS` does +not come with Aurora, so you'll need to either install it separately +or change the way the dataset is designated. + + class Profile(Struct): + version = Required(String) + environment = Required(String) + dataset = Default(String, hdfs://home/aurora/data/{{environment}}') + + PRODUCTION = Profile(version = 'live', environment = 'prod') + DEVEL = Profile(version = 'latest', + environment = 'devel', + dataset = 'hdfs://home/aurora/data/test') + TEST = Profile(version = 'latest', environment = 'test') + + JOB_TEMPLATE = Job( + name = 'application', + role = 'myteam', + cluster = 'cluster1', + environment = '{{profile.environment}}', + task = SequentialTask( + name = 'task', + resources = Resources(cpu = 2, ram = 4*GB, disk = 8*GB), + processes = [ + Process(name = 'main', cmdline = 'java -jar application.jar -hdfsPath + {{profile.dataset}}') + ] + ) + ) + + jobs = [ + JOB_TEMPLATE(instances = 100).bind(profile = PRODUCTION), + JOB_TEMPLATE.bind(profile = DEVEL), + JOB_TEMPLATE.bind(profile = TEST), + ] + +In this case, a custom structural "Profile" is created to self-document +the configuration to some degree. This also allows some schema +"type-checking", and for default self-substitution, e.g. in +`Profile.dataset` above. + +So rather than a `.bind()` with a half-dozen substituted variables, you +can bind a single object that has sensible defaults stored in a single +place. + +Configuration File Writing Tips And Best Practices +-------------------------------------------------- + +### Use As Few .aurora Files As Possible + +When creating your `.aurora` configuration, try to keep all versions of +a particular job within the same `.aurora` file. For example, if you +have separate jobs for `cluster1`, `cluster1` staging, `cluster1` +testing, and`cluster2`, keep them as close together as possible. + +Constructs shared across multiple jobs owned by your team (e.g. +team-level defaults or structural templates) can be split into separate +`.aurora`files and included via the `include` directive. + +### Avoid Boilerplate + +If you see repetition or find yourself copy and pasting any parts of +your configuration, it's likely an opportunity for templating. Take the +example below: + +`redundant.aurora` contains: + + download = Process( + name = 'download', + cmdline = 'wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2', + max_failures = 5, + min_duration = 1) + + unpack = Process( + name = 'unpack', + cmdline = 'rm -rf Python-2.7.3 && tar xzf Python-2.7.3.tar.bz2', + max_failures = 5, + min_duration = 1) + + build = Process( + name = 'build', + cmdline = 'pushd Python-2.7.3 && ./configure && make && popd', + max_failures = 1) + + email = Process( + name = 'email', + cmdline = 'echo Success | mail [email protected]', + max_failures = 5, + min_duration = 1) + + build_python = Task( + name = 'build_python', + processes = [download, unpack, build, email], + constraints = [Constraint(order = ['download', 'unpack', 'build', 'email'])]) + +As you'll notice, there's a lot of repetition in the `Process` +definitions. For example, almost every process sets a `max_failures` +limit to 5 and a `min_duration` to 1. This is an opportunity for factoring +into a common process template. + +Furthermore, the Python version is repeated everywhere. This can be +bound via structural templating as described in the [Advanced Binding](#AdvancedBinding) +section. + +`less_redundant.aurora` contains: + + class Python(Struct): + version = Required(String) + base = Default(String, 'Python-{{version}}') + package = Default(String, '{{base}}.tar.bz2') + + ReliableProcess = Process( + max_failures = 5, + min_duration = 1) + + download = ReliableProcess( + name = 'download', + cmdline = 'wget http://www.python.org/ftp/python/{{python.version}}/{{python.package}}') + + unpack = ReliableProcess( + name = 'unpack', + cmdline = 'rm -rf {{python.base}} && tar xzf {{python.package}}') + + build = ReliableProcess( + name = 'build', + cmdline = 'pushd {{python.base}} && ./configure && make && popd', + max_failures = 1) + + email = ReliableProcess( + name = 'email', + cmdline = 'echo Success | mail {{role}}@foocorp.com') + + build_python = SequentialTask( + name = 'build_python', + processes = [download, unpack, build, email]).bind(python = Python(version = "2.7.3")) + +### Thermos Uses bash, But Thermos Is Not bash + +#### Bad + +Many tiny Processes makes for harder to manage configurations. + + copy = Process( + name = 'copy', + cmdline = 'rcp user@my_machine:my_application .' + ) + + unpack = Process( + name = 'unpack', + cmdline = 'unzip app.zip' + ) + + remove = Process( + name = 'remove', + cmdline = 'rm -f app.zip' + ) + + run = Process( + name = 'app', + cmdline = 'java -jar app.jar' + ) + + run_task = Task( + processes = [copy, unpack, remove, run], + constraints = order(copy, unpack, remove, run) + ) + +#### Good + +Each `cmdline` runs in a bash subshell, so you have the full power of +bash. Chaining commands with `&&` or `||` is almost always the right +thing to do. + +Also for Tasks that are simply a list of processes that run one after +another, consider using the `SequentialTask` helper which applies a +linear ordering constraint for you. + + stage = Process( + name = 'stage', + cmdline = 'rcp user@my_machine:my_application . && unzip app.zip && rm -f app.zip') + + run = Process(name = 'app', cmdline = 'java -jar app.jar') + + run_task = SequentialTask(processes = [stage, run]) + +### Rarely Use Functions In Your Configurations + +90% of the time you define a function in a `.aurora` file, you're +probably Doing It Wrong(TM). + +#### Bad + + def get_my_task(name, user, cpu, ram, disk): + return Task( + name = name, + user = user, + processes = [STAGE_PROCESS, RUN_PROCESS], + constraints = order(STAGE_PROCESS, RUN_PROCESS), + resources = Resources(cpu = cpu, ram = ram, disk = disk) + ) + + task_one = get_my_task('task_one', 'feynman', 1.0, 32*MB, 1*GB) + task_two = get_my_task('task_two', 'feynman', 2.0, 64*MB, 1*GB) + +#### Good + +This one is more idiomatic. Forced keyword arguments prevents accidents, +e.g. constructing a task with "32*MB" when you mean 32MB of ram and not +disk. Less proliferation of task-construction techniques means +easier-to-read, quicker-to-understand, and a more composable +configuration. + + TASK_TEMPLATE = SequentialTask( + user = 'wickman', + processes = [STAGE_PROCESS, RUN_PROCESS], + ) + + task_one = TASK_TEMPLATE( + name = 'task_one', + resources = Resources(cpu = 1.0, ram = 32*MB, disk = 1*GB) ) + + task_two = TASK_TEMPLATE( + name = 'task_two', + resources = Resources(cpu = 2.0, ram = 64*MB, disk = 1*GB) + )
Added: aurora/site/source/documentation/0.12.0/cron-jobs.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/cron-jobs.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/cron-jobs.md (added) +++ aurora/site/source/documentation/0.12.0/cron-jobs.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,131 @@ +# Cron Jobs + +Aurora supports execution of scheduled jobs on a Mesos cluster using cron-style syntax. + +- [Overview](#overview) +- [Collision Policies](#collision-policies) + - [KILL_EXISTING](#kill_existing) + - [CANCEL_NEW](#cancel_new) +- [Failure recovery](#failure-recovery) +- [Interacting with cron jobs via the Aurora CLI](#interacting-with-cron-jobs-via-the-aurora-cli) + - [cron schedule](#cron-schedule) + - [cron deschedule](#cron-deschedule) + - [cron start](#cron-start) + - [job killall, job restart, job kill](#job-killall-job-restart-job-kill) +- [Technical Note About Syntax](#technical-note-about-syntax) +- [Caveats](#caveats) + - [Failovers](#failovers) + - [Collision policy is best-effort](#collision-policy-is-best-effort) + - [Timezone Configuration](#timezone-configuration) + +## Overview + +A job is identified as a cron job by the presence of a +`cron_schedule` attribute containing a cron-style schedule in the +[`Job`](/documentation/0.12.0/configuration-reference/#job-objects) object. Examples of cron schedules +include "every 5 minutes" (`*/5 * * * *`), "Fridays at 17:00" (`* 17 * * FRI`), and +"the 1st and 15th day of the month at 03:00" (`0 3 1,15 *`). + +Example (available in the [Vagrant environment](/documentation/0.12.0/vagrant/)): + + $ cat /vagrant/examples/job/cron_hello_world.aurora + # cron_hello_world.aurora + # A cron job that runs every 5 minutes. + jobs = [ + Job( + cluster = 'devcluster', + role = 'www-data', + environment = 'test', + name = 'cron_hello_world', + cron_schedule = '*/5 * * * *', + task = SimpleTask( + 'cron_hello_world', + 'echo "Hello world from cron, the time is now $(date --rfc-822)"'), + ), + ] + +## Collision Policies + +The `cron_collision_policy` field specifies the scheduler's behavior when a new cron job is +triggered while an older run hasn't finished. The scheduler has two policies available, +[KILL_EXISTING](#kill_existing) and [CANCEL_NEW](#cancel_new). + +### KILL_EXISTING + +The default policy - on a collision the old instances are killed and a instances with the current +configuration are started. + +### CANCEL_NEW + +On a collision the new run is cancelled. + +Note that the use of this flag is likely a code smell - interrupted cron jobs should be able +to recover their progress on a subsequent invocation, otherwise they risk having their work queue +grow faster than they can process it. + +## Failure recovery + +Unlike with services, which aurora will always re-execute regardless of exit status, instances of +cron jobs retry according to the `max_task_failures` attribute of the +[Task](/documentation/0.12.0/configuration-reference/#task-objects) object. To get "run-until-success" semantics, +set `max_task_failures` to `-1`. + +## Interacting with cron jobs via the Aurora CLI + +Most interaction with cron jobs takes place using the `cron` subcommand. See `aurora cron -h` +for up-to-date usage instructions. + +### cron schedule +Schedules a new cron job on the Aurora cluster for later runs or replaces the existing cron template +with a new one. Only future runs will be affected, any existing active tasks are left intact. + + $ aurora cron schedule devcluster/www-data/test/cron_hello_world /vagrant/examples/jobs/cron_hello_world.aurora + +### cron deschedule +Deschedules a cron job, preventing future runs but allowing current runs to complete. + + $ aurora cron deschedule devcluster/www-data/test/cron_hello_world + +### cron start +Start a cron job immediately, outside of its normal cron schedule. + + $ aurora cron start devcluster/www-data/test/cron_hello_world + +### job killall, job restart, job kill +Cron jobs create instances running on the cluster that you can interact with like normal Aurora +tasks with `job kill` and `job restart`. + +## Technical Note About Syntax + +`cron_schedule` uses a restricted subset of BSD crontab syntax. While the +execution engine currently uses Quartz, the schedule parsing is custom, a subset of FreeBSD +[crontab(5)](http://www.freebsd.org/cgi/man.cgi?crontab(5)) syntax. See +[the source](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/cron/CrontabEntry.java#L106-L124) +for details. + +## Caveats + +### Failovers +No failover recovery. Aurora does not record the latest minute it fired +triggers for across failovers. Therefore it's possible to miss triggers +on failover. Note that this behavior may change in the future. + +It's necessary to sync time between schedulers with something like `ntpd`. +Clock skew could cause double or missed triggers in the case of a failover. + +### Collision policy is best-effort +Aurora aims to always have *at least one copy* of a given instance running at a time - it's +an AP system, meaning it chooses Availability and Partition Tolerance at the expense of +Consistency. + +If your collision policy was `CANCEL_NEW` and a task has terminated but +Aurora has not noticed this Aurora will go ahead and create your new +task. + +If your collision policy was `KILL_EXISTING` and a task was marked `LOST` +but not yet GCed Aurora will go ahead and create your new task without +attempting to kill the old one (outside the GC interval). + +### Timezone Configuration +Cron timezone is configured indepdendently of JVM timezone with the `-cron_timezone` flag and +defaults to UTC. Added: aurora/site/source/documentation/0.12.0/deploying-aurora-scheduler.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/deploying-aurora-scheduler.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/deploying-aurora-scheduler.md (added) +++ aurora/site/source/documentation/0.12.0/deploying-aurora-scheduler.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,372 @@ +# Deploying the Aurora Scheduler + +When setting up your cluster, you will install the scheduler on a small number (usually 3 or 5) of +machines. This guide helps you get the scheduler set up and troubleshoot some common hurdles. + +- [Installing Aurora](#installing-aurora) + - [Creating the Distribution .zip File (Optional)](#creating-the-distribution-zip-file-optional) + - [Installing Aurora](#installing-aurora-1) +- [Configuring Aurora](#configuring-aurora) + - [A Note on Configuration](#a-note-on-configuration) + - [Replicated Log Configuration](#replicated-log-configuration) + - [Initializing the Replicated Log](#initializing-the-replicated-log) + - [Storage Performance Considerations](#storage-performance-considerations) + - [Network considerations](#network-considerations) + - [Considerations for running jobs in docker](#considerations-for-running-jobs-in-docker) + - [Security Considerations](#security-considerations) + - [Configuring Resource Oversubscription](#configuring-resource-oversubscription) + - [Process Logs](#process-logs) +- [Running Aurora](#running-aurora) + - [Maintaining an Aurora Installation](#maintaining-an-aurora-installation) + - [Monitoring](#monitoring) + - [Running stateful services](#running-stateful-services) + - [Dedicated attribute](#dedicated-attribute) + - [Syntax](#syntax) + - [Example](#example) +- [Best practices](#best-practices) + - [Diversity](#diversity) +- [Common problems](#common-problems) + - [Replicated log not initialized](#replicated-log-not-initialized) + - [Symptoms](#symptoms) + - [Solution](#solution) + - [Scheduler not registered](#scheduler-not-registered) + - [Symptoms](#symptoms-1) + - [Solution](#solution-1) +- [Changing Scheduler Quorum Size](#changing-scheduler-quorum-size) + - [Preparation](#preparation) + - [Adding New Schedulers](#adding-new-schedulers) + +## Installing Aurora +The Aurora scheduler is a standalone Java server. As part of the build process it creates a bundle +of all its dependencies, with the notable exceptions of the JVM and libmesos. Each target server +should have a JVM (Java 8 or higher) and libmesos (0.25.0) installed. + +### Creating the Distribution .zip File (Optional) +To create a distribution for installation you will need build tools installed. On Ubuntu this can be +done with `sudo apt-get install build-essential default-jdk`. + + git clone http://git-wip-us.apache.org/repos/asf/aurora.git + cd aurora + ./gradlew distZip + +Copy the generated `dist/distributions/aurora-scheduler-*.zip` to each node that will run a scheduler. + +### Installing Aurora +Extract the aurora-scheduler zip file. The example configurations assume it is extracted to +`/usr/local/aurora-scheduler`. + + sudo unzip dist/distributions/aurora-scheduler-*.zip -d /usr/local + sudo ln -nfs "$(ls -dt /usr/local/aurora-scheduler-* | head -1)" /usr/local/aurora-scheduler + +## Configuring Aurora + +### A Note on Configuration +Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora +"configuration file" is typically a `scheduler.sh` shell script of the form. + + #!/bin/bash + AURORA_HOME=/usr/local/aurora-scheduler + + # Flags controlling the JVM. + JAVA_OPTS=( + -Xmx2g + -Xms2g + # GC tuning, etc. + ) + + # Flags controlling the scheduler. + AURORA_FLAGS=( + -http_port=8081 + # Log configuration, etc. + ) + + # Environment variables controlling libmesos + export JAVA_HOME=... + export GLOG_v=1 + export LIBPROCESS_PORT=8083 + + JAVA_OPTS="${JAVA_OPTS[*]}" exec "$AURORA_HOME/bin/aurora-scheduler" "${AURORA_FLAGS[@]}" + +That way Aurora's current flags are visible in `ps` and in the `/vars` admin endpoint. + +Examples are available under `examples/scheduler/`. For a list of available Aurora flags and their +documentation run + + /usr/local/aurora-scheduler/bin/aurora-scheduler -help + +### Replicated Log Configuration +All Aurora state is persisted to a replicated log. This includes all jobs Aurora is running +including where in the cluster they are being run and the configuration for running them, as +well as other information such as metadata needed to reconnect to the Mesos master, resource +quotas, and any other locks in place. + +Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. Only one scheduler is +leader at a given time - the other schedulers follow log writes and prepare to take over as leader +but do not communicate with the Mesos master. Either 3 or 5 schedulers are recommended in a +production deployment depending on failure tolerance and they must have persistent storage. + +In a cluster with `N` schedulers, the flag `-native_log_quorum_size` should be set to +`floor(N/2) + 1`. So in a cluster with 1 scheduler it should be set to `1`, in a cluster with 3 it +should be set to `2`, and in a cluster of 5 it should be set to `3`. + + Number of schedulers (N) | ```-native_log_quorum_size``` setting (```floor(N/2) + 1```) + ------------------------ | ------------------------------------------------------------- + 1 | 1 + 3 | 2 + 5 | 3 + 7 | 4 + +*Incorrectly setting this flag will cause data corruption to occur!* + +See [this document](/documentation/0.12.0/storage-config/#scheduler-storage-configuration-flags) for more replicated +log and storage configuration options. + +## Initializing the Replicated Log +Before you start Aurora you will also need to initialize the log on a majority of the schedulers. + + mesos-log initialize --path="/path/to/native/log" + +The `--path` flag should match the `--native_log_file_path` flag to the scheduler. +Failing to do this will result the following message when you try to start the scheduler. + + Replica in EMPTY status received a broadcasted recover request + +### Storage Performance Considerations + +See [this document](/documentation/0.12.0/scheduler-storage/) for scheduler storage performance considerations. + +### Network considerations +The Aurora scheduler listens on 2 ports - an HTTP port used for client RPCs and a web UI, +and a libprocess (HTTP+Protobuf) port used to communicate with the Mesos master and for the log +replication protocol. These can be left unconfigured (the scheduler publishes all selected ports +to ZooKeeper) or explicitly set in the startup script as follows: + + # ... + AURORA_FLAGS=( + # ... + -http_port=8081 + # ... + ) + # ... + export LIBPROCESS_PORT=8083 + # ... + +### Considerations for running jobs in docker containers +In order for Aurora to launch jobs using docker containers, a few extra configuration options +must be set. The [docker containerizer](http://mesos.apache.org/documentation/latest/docker-containerizer/) +must be enabled on the mesos slaves by launching them with the `--containerizers=docker,mesos` option. + +By default, Aurora will configure Mesos to copy the file specified in `-thermos_executor_path` +into the container's sandbox. If using a wrapper script to launch the thermos executor, +specify the path to the wrapper in that argument. In addition, the path to the executor pex itself +must be included in the `-thermos_executor_resources` option. Doing so will ensure that both the +wrapper script and executor are correctly copied into the sandbox. Finally, ensure the wrapper +script does not access resources outside of the sandbox, as when the script is run from within a +docker container those resources will not exist. + +In order to correctly execute processes inside a job, the docker container must have python 2.7 +installed. + +A scheduler flag, `-global_container_mounts` allows mounting paths from the host (i.e., the slave) +into all containers on that host. The format is a comma separated list of host_path:container_path[:mode] +tuples. For example `-global_container_mounts=/opt/secret_keys_dir:/mnt/secret_keys_dir:ro` mounts +`/opt/secret_keys_dir` from the slaves into all launched containers. Valid modes are `ro` and `rw`. + +If you would like to supply your own parameters to `docker run` when launching jobs in docker +containers, you may use the following flags: + + -allow_docker_parameters + -default_docker_parameters + +`-allow_docker_parameters` controls whether or not users may pass their own configuration parameters +through the job configuration files. If set to `false` (the default), the scheduler will reject +jobs with custom parameters. *NOTE*: this setting should be used with caution as it allows any job +owner to specify any parameters they wish, including those that may introduce security concerns +(`privileged=true`, for example). + +`-default_docker_parameters` allows a cluster operator to specify a universal set of parameters that +should be used for every container that does not have parameters explicitly configured at the job +level. The argument accepts a multimap format: + + -default_docker_parameters="read-only=true,tmpfs=/tmp,tmpfs=/run" + +### Process Logs + +#### Log destination +By default, Thermos will write process stdout/stderr to log files in the sandbox. Process object configuration +allows specifying alternate log file destinations like streamed stdout/stderr or suppression of all log output. +Default behavior can be configured for the entire cluster with the following flag (through the `-thermos_executor_flags` +argument to the Aurora scheduler): + + --runner-logger-destination=both + +`both` configuration will send logs to files and stream to parent stdout/stderr outputs. + +See [this document](/documentation/0.12.0/configuration-reference/#logger) for all destination options. + +#### Log rotation +By default, Thermos will not rotate the stdout/stderr logs from child processes and they will grow +without bound. An individual user may change this behavior via configuration on the Process object, +but it may also be desirable to change the default configuration for the entire cluster. +In order to enable rotation by default, the following flags can be applied to Thermos (through the +-thermos_executor_flags argument to the Aurora scheduler): + + --runner-logger-mode=rotate + --runner-rotate-log-size-mb=100 + --runner-rotate-log-backups=10 + +In the above example, each instance of the Thermos runner will rotate stderr/stdout logs once they +reach 100 MiB in size and keep a maximum of 10 backups. If a user has provided a custom setting for +their process, it will override these default settings. + +## Running Aurora +Configure a supervisor like [Monit](http://mmonit.com/monit/) or +[supervisord](http://supervisord.org/) to run the created `scheduler.sh` file and restart it +whenever it fails. Aurora expects to be restarted by an external process when it fails. Aurora +supports an active health checking protocol on its admin HTTP interface - if a `GET /health` times +out or returns anything other than `200 OK` the scheduler process is unhealthy and should be +restarted. + +For example, monit can be configured with + + if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart + +assuming you set `-http_port=8081`. + +## Security Considerations + +See [security.md](/documentation/0.12.0/security/). + +## Configuring Resource Oversubscription + +**WARNING**: This feature is currently in alpha status. Do not use it in production clusters! +See [this document](/documentation/0.12.0/configuration-reference/#revocable-jobs) for more feature details. + +Set these scheduler flag to allow receiving revocable Mesos offers: + + -receive_revocable_resources=true + +Specify a tier configuration file path: + + -tier_config=path/to/tiers/config.json + +Example [tier configuration file](https://github.com/apache/aurora/blob/#{git_tag}/src/test/resources/org/apache/aurora/scheduler/tiers-example.json)). + +### Maintaining an Aurora Installation + +### Monitoring +Please see our dedicated [monitoring guide](/documentation/0.12.0/monitoring/) for in-depth discussion on monitoring. + +### Running stateful services +Aurora is best suited to run stateless applications, but it also accommodates for stateful services +like databases, or services that otherwise need to always run on the same machines. + +#### Dedicated attribute +The Mesos slave has the `--attributes` command line argument which can be used to mark a slave with +static attributes (not to be confused with `--resources`, which are dynamic and accounted). + +Aurora makes these attributes available for matching with scheduling +[constraints](/documentation/0.12.0/configuration-reference/#specifying-scheduling-constraints). Most of these +constraints are arbitrary and available for custom use. There is one exception, though: the +`dedicated` attribute. Aurora treats this specially, and only allows matching jobs to run on these +machines, and will only schedule matching jobs on these machines. + +See the [section](/documentation/0.12.0/resources/#resource-quota) about resource quotas to learn how quotas apply to +dedicated jobs. + +##### Syntax +The dedicated attribute has semantic meaning. The format is `$role(/.*)?`. When a job is created, +the scheduler requires that the `$role` component matches the `role` field in the job +configuration, and will reject the job creation otherwise. The remainder of the attribute is +free-form. We've developed the idiom of formatting this attribute as `$role/$job`, but do not +enforce this. + +##### Example +Consider the following slave command line: + + mesos-slave --attributes="dedicated:db_team/redis" ... + +And this job configuration: + + Service( + name = 'redis', + role = 'db_team', + constraints = { + 'dedicated': 'db_team/redis' + } + ... + ) + +The job configuration is indicating that it should only be scheduled on slaves with the attribute +`dedicated:db_team/redis`. Additionally, Aurora will prevent any tasks that do _not_ have that +constraint from running on those slaves. + +## Best practices +### Diversity +Data centers are often organized with hierarchical failure domains. Common failure domains +include hosts, racks, rows, and PDUs. If you have this information available, it is wise to tag +the mesos-slave with them as +[attributes](https://mesos.apache.org/documentation/attributes-resources/). + +When it comes time to schedule jobs, Aurora will automatically spread them across the failure +domains as specified in the +[job configuration](/documentation/0.12.0/configuration-reference/#specifying-scheduling-constraints). + +Note: in virtualized environments like EC2, the only attribute that usually makes sense for this +purpose is `host`. + +## Common problems +So you've started your first cluster and are running into some issues? We've collected some common +stumbling blocks and solutions here to help get you moving. + +### Replicated log not initialized + +#### Symptoms +- Scheduler RPCs and web interface claim `Storage is not READY` +- Scheduler log repeatedly prints messages like + + ``` + I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status + received a broadcasted recover request + I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response + from a replica in EMPTY status + ``` + +#### Solution +When you create a new cluster, you need to inform a quorum of schedulers that they are safe to +consider their database to be empty by [initializing](#initializing-the-replicated-log) the +replicated log. This is done to prevent the scheduler from modifying the cluster state in the event +of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path. + +### Scheduler not registered + +#### Symptoms +Scheduler log contains + + Framework has not been registered within the tolerated delay. + +#### Solution +Double-check that the scheduler is configured correctly to reach the master. If you are registering +the master in ZooKeeper, make sure command line argument to the master: + + --zk=zk://$ZK_HOST:2181/mesos/master + +is the same as the one on the scheduler: + + -mesos_master_address=zk://$ZK_HOST:2181/mesos/master + +## Changing Scheduler Quorum Size +Special care needs to be taken when changing the size of the Aurora scheduler quorum. +Since Aurora uses a Mesos replicated log, similar steps need to be followed as when +[changing the mesos quorum size](http://mesos.apache.org/documentation/latest/operational-guide). + +### Preparation +Increase [-native_log_quorum_size](/documentation/0.12.0/storage-config/#-native_log_quorum_size) on each +existing scheduler and restart them. When updating from 3 to 5 schedulers, the quorum size +would grow from 2 to 3. + +### Adding New Schedulers +Start the new schedulers with `-native_log_quorum_size` set to the new value. Failing to +first increase the quorum size on running schedulers can in some cases result in corruption +or truncating of the replicated log used by Aurora. In that case, see the documentation on +[recovering from backup](/documentation/0.12.0/storage-config/#recovering-from-a-scheduler-backup). Added: aurora/site/source/documentation/0.12.0/design-documents.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/design-documents.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/design-documents.md (added) +++ aurora/site/source/documentation/0.12.0/design-documents.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,17 @@ +# Design Documents + +Since its inception as an Apache project, larger feature additions to the +Aurora code base are discussed in form of design documents. Design documents +are living documents until a consensus has been reached to implement a feature +in the proposed form. + +Current and past documents: + +* [Command Hooks for the Aurora Client](design/command-hooks.md) +* [Health Checks for Updates](https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit) +* [JobUpdateDiff thrift API](https://docs.google.com/document/d/1Fc_YhhV7fc4D9Xv6gJzpfooxbK4YWZcvzw6Bd3qVTL8/edit) +* [REST API RFC](https://docs.google.com/document/d/11_lAsYIRlD5ETRzF2eSd3oa8LXAHYFD8rSetspYXaf4/edit) +* [Revocable Mesos offers in Aurora](https://docs.google.com/document/d/1r1WCHgmPJp5wbrqSZLsgtxPNj3sULfHrSFmxp2GyPTo/edit) +* [Ubiquitous Jobs](https://docs.google.com/document/d/12hr6GnUZU3mc7xsWRzMi3nQILGB-3vyUxvbG-6YmvdE/edit) + +Design documents can be found in the Aurora issue tracker via the query [`project = AURORA AND text ~ "docs.google.com" ORDER BY created`](https://issues.apache.org/jira/browse/AURORA-1528?jql=project%20%3D%20AURORA%20AND%20text%20~%20%22docs.google.com%22%20ORDER%20BY%20created). Added: aurora/site/source/documentation/0.12.0/design/command-hooks.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/design/command-hooks.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/design/command-hooks.md (added) +++ aurora/site/source/documentation/0.12.0/design/command-hooks.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,102 @@ +# Command Hooks for the Aurora Client + +## Introduction/Motivation + +We've got hooks in the client that surround API calls. These are +pretty awkward, because they don't correlate with user actions. For +example, suppose we wanted a policy that said users weren't allowed to +kill all instances of a production job at once. + +Right now, all that we could hook would be the "killJob" api call. But +kill (at least in newer versions of the client) normally runs in +batches. If a user called killall, what we would see on the API level +is a series of "killJob" calls, each of which specified a batch of +instances. We woudn't be able to distinguish between really killing +all instances of a job (which is forbidden under this policy), and +carefully killing in batches (which is permitted.) In each case, the +hook would just see a series of API calls, and couldn't find out what +the actual command being executed was! + +For most policy enforcement, what we really want to be able to do is +look at and vet the commands that a user is performing, not the API +calls that the client uses to implement those commands. + +So I propose that we add a new kind of hooks, which surround noun/verb +commands. A hook will register itself to handle a collection of (noun, +verb) pairs. Whenever any of those noun/verb commands are invoked, the +hooks methods will be called around the execution of the verb. A +pre-hook will have the ability to reject a command, preventing the +verb from being executed. + +## Registering Hooks + +These hooks will be registered via configuration plugins. A configuration plugin +can register hooks using an API. Hooks registered this way are, effectively, +hardwired into the client executable. + +The order of execution of hooks is unspecified: they may be called in +any order. There is no way to guarantee that one hook will execute +before some other hook. + + +### Global Hooks + +Commands registered by the python call are called _global_ hooks, +because they will run for all configurations, whether or not they +specify any hooks in the configuration file. + +In the implementation, hooks are registered in the module +`apache.aurora.client.cli.command_hooks`, using the class +`GlobalCommandHookRegistry`. A global hook can be registered by calling +`GlobalCommandHookRegistry.register_command_hook` in a configuration plugin. + +### The API + + class CommandHook(object) + @property + def name(self): + """Returns a name for the hook." + + def get_nouns(self): + """Return the nouns that have verbs that should invoke this hook.""" + + def get_verbs(self, noun): + """Return the verbs for a particular noun that should invoke his hook.""" + + @abstractmethod + def pre_command(self, noun, verb, context, commandline): + """Execute a hook before invoking a verb. + * noun: the noun being invoked. + * verb: the verb being invoked. + * context: the context object that will be used to invoke the verb. + The options object will be initialized before calling the hook + * commandline: the original argv collection used to invoke the client. + Returns: True if the command should be allowed to proceed; False if the command + should be rejected. + """ + + def post_command(self, noun, verb, context, commandline, result): + """Execute a hook after invoking a verb. + * noun: the noun being invoked. + * verb: the verb being invoked. + * context: the context object that will be used to invoke the verb. + The options object will be initialized before calling the hook + * commandline: the original argv collection used to invoke the client. + * result: the result code returned by the verb. + Returns: nothing + """ + + class GlobalCommandHookRegistry(object): + @classmethod + def register_command_hook(self, hook): + pass + +### Skipping Hooks + +To skip a hook, a user uses a command-line option, `--skip-hooks`. The option can either +specify specific hooks to skip, or "all": + +* `aurora --skip-hooks=all job create east/bozo/devel/myjob` will create a job + without running any hooks. +* `aurora --skip-hooks=test,iq create east/bozo/devel/myjob` will create a job, + and will skip only the hooks named "test" and "iq". Added: aurora/site/source/documentation/0.12.0/developing-aurora-client.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/developing-aurora-client.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/developing-aurora-client.md (added) +++ aurora/site/source/documentation/0.12.0/developing-aurora-client.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,93 @@ +Getting Started +=============== + +The client is written in Python, and uses the +[Pants](http://pantsbuild.github.io/python-readme.html) build tool. + +Client Configuration +==================== + +The client uses a configuration file that specifies available clusters. More information about the +contents of this file can be found in the +[Client Cluster Configuration](/documentation/0.12.0/client-cluster-configuration/) documentation. Information about +how the client locates this file can be found in the +[Client Commands](/documentation/0.12.0/client-commands/#cluster-configuration) documentation. + +Building and Testing the Client +=============================== + +Building and testing the client code are both done using Pants. The relevant targets to know about +are: + + * Build a client executable: `./pants binary src/main/python/apache/aurora/client:aurora` + * Test client code: `./pants test src/test/python/apache/aurora/client/cli:all` + +If you want to build a source distribution of the client, you need to run `./build-support/release/make-python-sdists`. + +Running/Debugging the Client +============================ + +For manually testing client changes against a cluster, we use [Vagrant](https://www.vagrantup.com/). +To start a virtual cluster, you need to install Vagrant, and then run `vagrant up` for the root of +the aurora workspace. This will create a vagrant host named "devcluster", with a mesos master, a set +of mesos slaves, and an aurora scheduler. + +If you have a change you would like to test in your local cluster, you'll rebuild the client: + + vagrant ssh -c 'aurorabuild client' + +Once this completes, the `aurora` command will reflect your changes. + +Running/Debugging the Client in PyCharm +======================================= + +It's possible to use PyCharm to run and debug both the client and client tests in an IDE. In order +to do this, first run: + + build-support/python/make-pycharm-virtualenv + +This script will configure a virtualenv with all of our Python requirements. Once the script +completes it will emit instructions for configuring PyCharm: + + Your PyCharm environment is now set up. You can open the project root + directory with PyCharm. + + Once the project is loaded: + - open project settings + - click 'Project Interpreter' + - click the cog in the upper-right corner + - click 'Add Local' + - select 'build-support/python/pycharm.venv/bin/python' + - click 'OK' + +### Running/Debugging Tests + +After following these instructions, you should now be able to run/debug tests directly from the IDE +by right-clicking on a test (or test class) and choosing to run or debug: + +[](images/debug-client-test.png) + +If you've set a breakpoint, you can see the run will now stop and let you debug: + +[](images/debugging-client-test.png) + +### Running/Debugging the Client + +Actually running and debugging the client is unfortunately a bit more complex. You'll need to create +a Run configuration: + +* Go to Run â Edit Configurations +* Click the + icon to add a new configuration. +* Choose python and name the configuration 'client'. +* Set the script path to `/your/path/to/aurora/src/main/python/apache/aurora/client/cli/client.py` +* Set the script parameters to the command you want to run (e.g. `job status <job key>`) +* Expand the Environment section and click the ellipsis to add a new environment variable +* Click the + at the bottom to add a new variable named AURORA_CONFIG_ROOT whose value is the + path where the your cluster configuration can be found. For example, to talk to the scheduler + running in the vagrant image, it would be set to `/your/path/to/aurora/examples/vagrant` (this + is the directory where our example clusters.json is found). +* You should now be able to run and debug this configuration! + +Making thrift schema changes +============================ +See [this document](/documentation/0.12.0/thrift-deprecation/) for any thrift related changes. Added: aurora/site/source/documentation/0.12.0/developing-aurora-scheduler.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.12.0/developing-aurora-scheduler.md?rev=1733548&view=auto ============================================================================== --- aurora/site/source/documentation/0.12.0/developing-aurora-scheduler.md (added) +++ aurora/site/source/documentation/0.12.0/developing-aurora-scheduler.md Fri Mar 4 02:43:01 2016 @@ -0,0 +1,163 @@ +Java code in the aurora repo is built with [Gradle](http://gradle.org). + + +Prerequisite +============ + +When using Apache Aurora checked out from the source repository or the binary +distribution, the Gradle wrapper and JavaScript dependencies are provided. +However, you need to manually install them when using the source release +downloads: + +1. Install Gradle following the instructions on the [Gradle web site](http://gradle.org) +2. From the root directory of the Apache Aurora project generate the gradle +wrapper by running: + + gradle wrapper + + +Getting Started +=============== + +You will need Java 8 installed and on your `PATH` or unzipped somewhere with `JAVA_HOME` set. Then + + ./gradlew tasks + +will bootstrap the build system and show available tasks. This can take a while the first time you +run it but subsequent runs will be much faster due to cached artifacts. + +Running the Tests +----------------- +Aurora has a comprehensive unit test suite. To run the tests use + + ./gradlew build + +Gradle will only re-run tests when dependencies of them have changed. To force a re-run of all +tests use + + ./gradlew clean build + +Running the build with code quality checks +------------------------------------------ +To speed up development iteration, the plain gradle commands will not run static analysis tools. +However, you should run these before posting a review diff, and **always** run this before pushing a +commit to origin/master. + + ./gradlew build -Pq + +Running integration tests +------------------------- +To run the same tests that are run in the Apache Aurora continuous integration +environment: + + ./build-support/jenkins/build.sh + + +In addition, there is an end-to-end test that runs a suite of aurora commands +using a virtual cluster: + + ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh + + + +Creating a bundle for deployment +-------------------------------- +Gradle can create a zip file containing Aurora, all of its dependencies, and a launch script with + + ./gradlew distZip + +or a tar file containing the same files with + + ./gradlew distTar + +The output file will be written to `dist/distributions/aurora-scheduler.zip` or +`dist/distributions/aurora-scheduler.tar`. + +Developing Aurora Java code +=========================== + +Setting up an IDE +----------------- +Gradle can generate project files for your IDE. To generate an IntelliJ IDEA project run + + ./gradlew idea + +and import the generated `aurora.ipr` file. + +Adding or Upgrading a Dependency +-------------------------------- +New dependencies can be added from Maven central by adding a `compile` dependency to `build.gradle`. +For example, to add a dependency on `com.example`'s `example-lib` 1.0 add this block: + + compile 'com.example:example-lib:1.0' + +NOTE: Anyone thinking about adding a new dependency should first familiarize themself with the +Apache Foundation's third-party licensing +[policy](http://www.apache.org/legal/resolved.html#category-x). + +Developing Aurora UI +====================== + +Installing bower (optional) +---------------------------- +Third party JS libraries used in Aurora (located at 3rdparty/javascript/bower_components) are +managed by bower, a JS dependency manager. Bower is only required if you plan to add, remove or +update JS libraries. Bower can be installed using the following command: + + npm install -g bower + +Bower depends on node.js and npm. The easiest way to install node on a mac is via brew: + + brew install node + +For more node.js installation options refer to https://github.com/joyent/node/wiki/Installation. + +More info on installing and using bower can be found at: http://bower.io/. Once installed, you can +use the following commands to view and modify the bower repo at +3rdparty/javascript/bower_components + + bower list + bower install <library name> + bower remove <library name> + bower update <library name> + bower help + +Faster Iteration in Vagrant +--------------------------- +The scheduler serves UI assets from the classpath. For production deployments this means the assets +are served from within a jar. However, for faster development iteration, the vagrant image is +configured to add `/vagrant/dist/resources/main` to the head of CLASSPATH. This path is configured +as a shared filesystem to the path on the host system where your Aurora repository lives. This means +that any updates to dist/resources/main in your checkout will be reflected immediately in the UI +served from within the vagrant image. + +The one caveat to this is that this path is under `dist` not `src`. This is because the assets must +be processed by gradle before they can be served. So, unfortunately, you cannot just save your local +changes and see them reflected in the UI, you must first run `./gradlew processResources`. This is +less than ideal, but better than having to restart the scheduler after every change. Additionally, +gradle makes this process somewhat easier with the use of the `--continuous` flag. If you run: +`./gradlew processResources --continuous` gradle will monitor the filesystem for changes and run the +task automatically as necessary. This doesn't quite provide hot-reload capabilities, but it does +allow for <5s from save to changes being visibile in the UI with no further action required on the +part of the developer. + +Developing the Aurora Build System +================================== + +Bootstrapping Gradle +-------------------- +The following files were autogenerated by `gradle wrapper` using gradle 1.8's +[Wrapper](http://www.gradle.org/docs/1.8/dsl/org.gradle.api.tasks.wrapper.Wrapper.html) plugin and +should not be modified directly: + + ./gradlew + ./gradlew.bat + ./gradle/wrapper/gradle-wrapper.jar + ./gradle/wrapper/gradle-wrapper.properties + +To upgrade Gradle unpack the new version somewhere, run `/path/to/new/gradle wrapper` in the +repository root and commit the changed files. + +Making thrift schema changes +============================ +See [this document](/documentation/0.12.0/thrift-deprecation/) for any thrift related changes.
