Added: aurora/site/publish/documentation/0.11.0/monitoring/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/monitoring/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/monitoring/index.html (added) +++ aurora/site/publish/documentation/0.11.0/monitoring/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,301 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/monitoring/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<h1 id="monitoring-your-aurora-cluster">Monitoring your Aurora cluster</h1> + +<p>Before you start running important services in your Aurora cluster, it’s important to set up +monitoring and alerting of Aurora itself. Most of your monitoring can be against the scheduler, +since it will give you a global view of what’s going on.</p> + +<h2 id="reading-stats">Reading stats</h2> + +<p>The scheduler exposes a <em>lot</em> of instrumentation data via its HTTP interface. You can get a quick +peek at the first few of these in our vagrant image:</p> +<pre class="highlight plaintext"><code>$ vagrant ssh -c 'curl -s localhost:8081/vars | head' +async_tasks_completed 1004 +attribute_store_fetch_all_events 15 +attribute_store_fetch_all_events_per_sec 0.0 +attribute_store_fetch_all_nanos_per_event 0.0 +attribute_store_fetch_all_nanos_total 3048285 +attribute_store_fetch_all_nanos_total_per_sec 0.0 +attribute_store_fetch_one_events 3391 +attribute_store_fetch_one_events_per_sec 0.0 +attribute_store_fetch_one_nanos_per_event 0.0 +attribute_store_fetch_one_nanos_total 454690753 +</code></pre> + +<p>These values are served as <code>Content-Type: text/plain</code>, with each line containing a space-separated metric +name and value. Values may be integers, doubles, or strings (note: strings are static, others +may be dynamic).</p> + +<p>If your monitoring infrastructure prefers JSON, the scheduler exports that as well:</p> +<pre class="highlight plaintext"><code>$ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head' +{ + "async_tasks_completed": 1009, + "attribute_store_fetch_all_events": 15, + "attribute_store_fetch_all_events_per_sec": 0.0, + "attribute_store_fetch_all_nanos_per_event": 0.0, + "attribute_store_fetch_all_nanos_total": 3048285, + "attribute_store_fetch_all_nanos_total_per_sec": 0.0, + "attribute_store_fetch_one_events": 3409, + "attribute_store_fetch_one_events_per_sec": 0.0, + "attribute_store_fetch_one_nanos_per_event": 0.0, +</code></pre> + +<p>This will be the same data as above, served with <code>Content-Type: application/json</code>.</p> + +<h2 id="viewing-live-stat-samples-on-the-scheduler">Viewing live stat samples on the scheduler</h2> + +<p>The scheduler uses the Twitter commons stats library, which keeps an internal time-series database +of exported variables - nearly everything in <code>/vars</code> is available for instant graphing. This is +useful for debugging, but is not a replacement for an external monitoring system.</p> + +<p>You can view these graphs on a scheduler at <code>/graphview</code>. It supports some composition and +aggregation of values, which can be invaluable when triaging a problem. For example, if you have +the scheduler running in vagrant, check out these links: +<a href="http://192.168.33.7:8081/graphview?query=jvm_uptime_secs">simple graph</a> +<a href="http://192.168.33.7:8081/graphview?query=rate(scheduler_log_native_append_nanos_total)%2Frate(scheduler_log_native_append_events)%2F1e6">complex composition</a></p> + +<h3 id="counters-and-gauges">Counters and gauges</h3> + +<p>Among numeric stats, there are two fundamental types of stats exported: <em>counters</em> and <em>gauges</em>. +Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges +may decrease in value. Aurora uses counters to represent things like the number of times an event +has occurred, and gauges to capture things like the current length of a queue. Counters are a +natural fit for accurate composition into <a href="http://en.wikipedia.org/wiki/Rate_ratio">rate ratios</a> +(useful for sample-resistant latency calculation), while gauges are not.</p> + +<h1 id="alerting">Alerting</h1> + +<h2 id="quickstart">Quickstart</h2> + +<p>If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting +on <code>framework_registered</code> and <code>task_store_LOST</code>. These will give you a decent picture of overall +health.</p> + +<h2 id="a-note-on-thresholds">A note on thresholds</h2> + +<p>One of the most difficult things in monitoring is choosing alert thresholds. With many of these +stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It +will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We +recommend you start with a strict value after viewing a small amount of collected data, and then +adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts +and thresholds make sense.</p> + +<h2 id="important-stats">Important stats</h2> + +<h3 id="jvm_uptime_secs"><code>jvm_uptime_secs</code></h3> + +<p>Type: integer counter</p> + +<p>The number of seconds the JVM process has been running. Comes from +<a href="http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime()">RuntimeMXBean#getUptime()</a></p> + +<p>Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to +stay alive.</p> + +<p>Look at the scheduler logs to identify the reason the scheduler is exiting.</p> + +<h3 id="system_load_avg"><code>system_load_avg</code></h3> + +<p>Type: double gauge</p> + +<p>The current load average of the system for the last minute. Comes from +<a href="http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage()">OperatingSystemMXBean#getSystemLoadAverage()</a>.</p> + +<p>A high sustained value suggests that the scheduler machine may be over-utilized.</p> + +<p>Use standard unix tools like <code>top</code> and <code>ps</code> to track down the offending process(es).</p> + +<h3 id="process_cpu_cores_utilized"><code>process_cpu_cores_utilized</code></h3> + +<p>Type: double gauge</p> + +<p>The current number of CPU cores in use by the JVM process. This should not exceed the number of +logical CPU cores on the machine. Derived from +<a href="http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html">OperatingSystemMXBean#getProcessCpuTime()</a></p> + +<p>A high sustained value indicates that the scheduler is overworked. Due to current internal design +limitations, if this value is sustained at <code>1</code>, there is a good chance the scheduler is under water.</p> + +<p>There are two main inputs that tend to drive this figure: task scheduling attempts and status +updates from Mesos. You may see activity in the scheduler logs to give an indication of where +time is being spent. Beyond that, it really takes good familiarity with the code to effectively +triage this. We suggest engaging with an Aurora developer.</p> + +<h3 id="task_store_lost"><code>task_store_LOST</code></h3> + +<p>Type: integer gauge</p> + +<p>The number of tasks stored in the scheduler that are in the <code>LOST</code> state, and have been rescheduled.</p> + +<p>If this value is increasing at a high rate, it is a sign of trouble.</p> + +<p>There are many sources of <code>LOST</code> tasks in Mesos: the scheduler, master, slave, and executor can all +trigger this. The first step is to look in the scheduler logs for <code>LOST</code> to identify where the +state changes are originating.</p> + +<h3 id="scheduler_resource_offers"><code>scheduler_resource_offers</code></h3> + +<p>Type: integer counter</p> + +<p>The number of resource offers that the scheduler has received.</p> + +<p>For a healthy scheduler, this value must be increasing over time.</p> + +<p>Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it +is sending offers. You should also look at the master’s web interface to see if it has a large +number of outstanding offers that it is waiting to be returned.</p> + +<h3 id="framework_registered"><code>framework_registered</code></h3> + +<p>Type: binary integer counter</p> + +<p>Will be <code>1</code> for the leading scheduler that is registered with the Mesos master, <code>0</code> for passive +schedulers,</p> + +<p>A sustained period without a <code>1</code> (or where <code>sum() != 1</code>) warrants investigation.</p> + +<p>If there is no leading scheduler, look in the scheduler and master logs for why. If there are +multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical +bug.</p> + +<h3 id="rate-scheduler_log_native_append_nanos_total-rate-scheduler_log_native_append_events"><code>rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)</code></h3> + +<p>Type: rate ratio of integer counters</p> + +<p>This composes two counters to compute a windowed figure for the latency of replicated log writes.</p> + +<p>A hike in this value suggests disk bandwidth contention.</p> + +<p>Look in scheduler logs for any reported oddness with saving to the replicated log. Also use +standard tools like <code>vmstat</code> and <code>iotop</code> to identify whether the disk has become slow or +over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.</p> + +<h3 id="timed_out_tasks"><code>timed_out_tasks</code></h3> + +<p>Type: integer counter</p> + +<p>Tracks the number of times the scheduler has given up while waiting +(for <code>-transient_task_state_timeout</code>) to hear back about a task that is in a transient state +(e.g. <code>ASSIGNED</code>, <code>KILLING</code>), and has moved to <code>LOST</code> before rescheduling.</p> + +<p>This value is currently known to increase occasionally when the scheduler fails over +(<a href="https://issues.apache.org/jira/browse/AURORA-740">AURORA-740</a>). However, any large spike in this +value warrants investigation.</p> + +<p>The scheduler will log when it times out a task. You should trace the task ID of the timed out +task into the master, slave, and/or executors to determine where the message was dropped.</p> + +<h3 id="http_500_responses_events"><code>http_500_responses_events</code></h3> + +<p>Type: integer counter</p> + +<p>The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.</p> + +<p>An increase warrants investigation.</p> + +<p>Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.</p> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html>
Added: aurora/site/publish/documentation/0.11.0/presentations/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/presentations/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/presentations/index.html (added) +++ aurora/site/publish/documentation/0.11.0/presentations/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,157 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/presentations/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<h1 id="apache-aurora-presentations">Apache Aurora Presentations</h1> + +<p>Video and slides from presentations and panel discussions about Apache Aurora.</p> + +<p><em>(Listed in date descending order)</em></p> + +<table> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png" alt="From Monolith to Microservices with Aurora Video Thumbnail" /></td> + <td><strong><a href="https://www.youtube.com/watch?v=yXkOgnyK4Hw">From Monolith to Microservices w/ Aurora (Video)</a></strong> + <p>Presented by Thanos Baskous, Tony Dong, Dobromir Montauk</p> + <p>April 30, 2015 at <a href="http://www.meetup.com/Bay-Area-Apache-Aurora-Users-Group/events/221219480/">Bay Area Apache Aurora Users Group</a></p></td> + </tr> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/02_28_2015_apache_aurora_thumb.png" alt="Apache Auroraã®å§ããã Slideshow Thumbnail" /></td> + <td><strong><a href="http://www.slideshare.net/zembutsu/apache-aurora-introduction-and-tutorial-osc15tk">Apache Auroraã®å§ããã (Slides)</a></strong> + <p>Presented by Masahito Zembutsu</p> + <p>February 28, 2015 at <a href="http://www.ospn.jp/osc2015-spring/">Open Source Conference 2015 Tokyo Spring</a></p></td> + </tr> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png" alt="Apache Aurora Adopters Panel Video Thumbnail" /></td> + <td><strong><a href="https://www.youtube.com/watch?v=2Jsj0zFdRlg">Apache Aurora Adopters Panel (Video)</a></strong> + <p>Panelists Ben Staffin, Josh Adams, Bill Farner, Berk Demir</p> + <p>February 19, 2015 at <a href="http://www.meetup.com/Bay-Area-Mesos-User-Group/events/220279080/">Bay Area Mesos Users Group</a></p></td> + </tr> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png" alt="Operating Apache Aurora and Mesos at Twitter Video Thumbnail" /></td> + <td><strong><a href="https://www.youtube.com/watch?v=E4lxX6epM_U">Operating Apache Aurora and Mesos at Twitter (Video)</a></strong> + <p>Presented by Joe Smith</p> + <p>February 19, 2015 at <a href="http://www.meetup.com/Bay-Area-Mesos-User-Group/events/220279080/">Bay Area Mesos Users Group</a></p></td> + </tr> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png" alt="Apache Aurora and Mesos at TellApart" /></td> + <td><strong><a href="https://www.youtube.com/watch?v=ZZXtXLvTXAE">Apache Aurora and Mesos at TellApart (Video)</a></strong> + <p>Presented by Steve Niemitz</p> + <p>February 19, 2015 at <a href="http://www.meetup.com/Bay-Area-Mesos-User-Group/events/220279080/">Bay Area Mesos Users Group</a></p></td> + </tr> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/08_21_2014_past_present_future_thumb.png" alt="Past, Present, and Future of the Aurora Scheduler Video Thumbnail" /></td> + <td><strong><a href="https://www.youtube.com/watch?v=Dsc5CPhKs4o">Past, Present, and Future of the Aurora Scheduler (Video)</a></strong> + <p>Presented by Bill Farner</p> + <p>August 21, 2014 at <a href="http://events.linuxfoundation.org/events/archive/2014/mesoscon">#MesosCon 2014</a></p> +</td> + </tr> + <tr> + <td><img src="/documentation/0.11.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png" alt="Introduction to Apache Aurora Video Thumbnail" /></td> + <td><strong><a href="https://www.youtube.com/watch?v=asd_h6VzaJc">Introduction to Apache Aurora (Video)</a></strong> + <p>Presented by Bill Farner</p> + <p>March 25, 2014 at <a href="https://www.eventbrite.com/e/aurora-and-mesosframeworksmeetup-tickets-10850994617">Aurora and Mesos Frameworks Meetup</a></p></td> + </tr> +</table> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html> Added: aurora/site/publish/documentation/0.11.0/resources/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/resources/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/resources/index.html (added) +++ aurora/site/publish/documentation/0.11.0/resources/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,273 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/resources/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<h1 id="resources-and-sizing">Resources and Sizing</h1> + +<ul> +<li><a href="#introduction">Introduction</a></li> +<li><a href="#cpu-isolation">CPU Isolation</a></li> +<li><a href="#cpu-sizing">CPU Sizing</a></li> +<li><a href="#memory-isolation">Memory Isolation</a></li> +<li><a href="#memory-sizing">Memory Sizing</a></li> +<li><a href="#disk-space">Disk Space</a></li> +<li><a href="#disk-space-sizing">Disk Space Sizing</a></li> +<li><a href="#other-resources">Other Resources</a></li> +<li><a href="#resource-quota">Resource Quota</a></li> +<li><a href="#task-preemption">Task Preemption</a></li> +</ul> + +<h2 id="introduction">Introduction</h2> + +<p>Aurora is a multi-tenant system; a single software instance runs on a +server, serving multiple clients/tenants. To share resources among +tenants, it implements isolation of:</p> + +<ul> +<li>CPU</li> +<li>memory</li> +<li>disk space</li> +</ul> + +<p>CPU is a soft limit, and handled differently from memory and disk space. +Too low a CPU value results in throttling your application and +slowing it down. Memory and disk space are both hard limits; when your +application goes over these values, it’s killed.</p> + +<p>Let’s look at each resource type in more detail:</p> + +<h2 id="cpu-isolation">CPU Isolation</h2> + +<p>Mesos uses a quota based CPU scheduler (the <em>Completely Fair Scheduler</em>) +to provide consistent and predictable performance. This is effectively +a guarantee of resources – you receive at least what you requested, but +also no more than you’ve requested.</p> + +<p>The scheduler gives applications a CPU quota for every 100 ms interval. +When an application uses its quota for an interval, it is throttled for +the rest of the 100 ms. Usage resets for each interval and unused +quota does not carry over.</p> + +<p>For example, an application specifying 4.0 CPU has access to 400 ms of +CPU time every 100 ms. This CPU quota can be used in different ways, +depending on the application and available resources. Consider the +scenarios shown in this diagram.</p> + +<p><img alt="CPU Availability" src="../images/CPUavailability.png" /></p> + +<ul> +<li><p><em>Scenario A</em>: the application can use up to 4 cores continuously for +every 100 ms interval. It is never throttled and starts processing +new requests immediately.</p></li> +<li><p><em>Scenario B</em> : the application uses up to 8 cores (depending on +availability) but is throttled after 50 ms. The CPU quota resets at the +start of each new 100 ms interval.</p></li> +<li><p><em>Scenario C</em> : is like Scenario A, but there is a garbage collection +event in the second interval that consumes all CPU quota. The +application throttles for the remaining 75 ms of that interval and +cannot service requests until the next interval. In this example, the +garbage collection finished in one interval but, depending on how much +garbage needs collecting, it may take more than one interval and further +delay service of requests.</p></li> +</ul> + +<p><em>Technical Note</em>: Mesos considers logical cores, also known as +hyperthreading or SMT cores, as the unit of CPU.</p> + +<h2 id="cpu-sizing">CPU Sizing</h2> + +<p>To correctly size Aurora-run Mesos tasks, specify a per-shard CPU value +that lets the task run at its desired performance when at peak load +distributed across all shards. Include reserve capacity of at least 50%, +possibly more, depending on how critical your service is (or how +confident you are about your original estimate : -)), ideally by +increasing the number of shards to also improve resiliency. When running +your application, observe its CPU stats over time. If consistently at or +near your quota during peak load, you should consider increasing either +per-shard CPU or the number of shards.</p> + +<h2 id="memory-isolation">Memory Isolation</h2> + +<p>Mesos uses dedicated memory allocation. Your application always has +access to the amount of memory specified in your configuration. The +application’s memory use is defined as the sum of the resident set size +(RSS) of all processes in a shard. Each shard is considered +independently.</p> + +<p>In other words, say you specified a memory size of 10GB. Each shard +would receive 10GB of memory. If an individual shard’s memory demands +exceed 10GB, that shard is killed, but the other shards continue +working.</p> + +<p><em>Technical note</em>: Total memory size is not enforced at allocation time, +so your application can request more than its allocation without getting +an ENOMEM. However, it will be killed shortly after.</p> + +<h2 id="memory-sizing">Memory Sizing</h2> + +<p>Size for your application’s peak requirement. Observe the per-instance +memory statistics over time, as memory requirements can vary over +different periods. Remember that if your application exceeds its memory +value, it will be killed, so you should also add a safety margin of +around 10-20%. If you have the ability to do so, you may also want to +put alerts on the per-instance memory.</p> + +<h2 id="disk-space">Disk Space</h2> + +<p>Disk space used by your application is defined as the sum of the files’ +disk space in your application’s directory, including the <code>stdout</code> and +<code>stderr</code> logged from your application. Each shard is considered +independently. You should use off-node storage for your application’s +data whenever possible.</p> + +<p>In other words, say you specified disk space size of 100MB. Each shard +would receive 100MB of disk space. If an individual shard’s disk space +demands exceed 100MB, that shard is killed, but the other shards +continue working.</p> + +<p>After your application finishes running, its allocated disk space is +reclaimed. Thus, your job’s final action should move any disk content +that you want to keep, such as logs, to your home file system or other +less transitory storage. Disk reclamation takes place an undefined +period after the application finish time; until then, the disk contents +are still available but you shouldn’t count on them being so.</p> + +<p><em>Technical note</em> : Disk space is not enforced at write so your +application can write above its quota without getting an ENOSPC, but it +will be killed shortly after. This is subject to change.</p> + +<h2 id="disk-space-sizing">Disk Space Sizing</h2> + +<p>Size for your application’s peak requirement. Rotate and discard log +files as needed to stay within your quota. When running a Java process, +add the maximum size of the Java heap to your disk space requirement, in +order to account for an out of memory error dumping the heap +into the application’s sandbox space.</p> + +<h2 id="other-resources">Other Resources</h2> + +<p>Other resources, such as network bandwidth, do not have any performance +guarantees. For some resources, such as memory bandwidth, there are no +practical sharing methods so some application combinations collocated on +the same host may cause contention.</p> + +<h2 id="resource-quota">Resource Quota</h2> + +<p>Aurora requires resource quotas for +<a href="/documentation/0.11.0/configuration-reference/#job-objects">production non-dedicated jobs</a>. Quota is enforced at +the job role level and when set, defines a non-preemptible pool of compute resources within +that role.</p> + +<p>To grant quota to a particular role in production use <code>aurora_admin set_quota</code> command.</p> + +<p>NOTE: all job types (service, adhoc or cron) require role resource quota unless a job has +<a href="/documentation/0.11.0/deploying-aurora-scheduler/#dedicated-attribute">dedicated constraint set</a>.</p> + +<h2 id="task-preemption">Task preemption</h2> + +<p>Under a particular resource shortage pressure, tasks from +<a href="/documentation/0.11.0/configuration-reference/#job-objects">production</a> jobs may preempt tasks from any non-production +job. A production task may only be preempted by tasks from production jobs in the same role with +higher <a href="/documentation/0.11.0/configuration-reference/#job-objects">priority</a>.</p> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html> Added: aurora/site/publish/documentation/0.11.0/scheduler-storage/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/scheduler-storage/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/scheduler-storage/index.html (added) +++ aurora/site/publish/documentation/0.11.0/scheduler-storage/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,153 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/scheduler-storage/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<h1 id="snapshot-performance">Snapshot Performance</h1> + +<p>Periodically the scheduler writes a full snapshot of its state to the replicated log. To do this +it needs to hold a global storage write lock while it writes out this data. In large clusters +this has been observed to take up to 40 seconds. Long pauses can cause issues in the system, +including delays in scheduling new tasks.</p> + +<p>The scheduler has two optimizations to reduce the size of snapshots and thus improve snapshot +performance: compression and deduplication. Most users will want to enable both compression +and deduplication.</p> + +<h2 id="compression">Compression</h2> + +<p>To reduce the size of the snapshot the DEFLATE algorithm can be applied to the serialized bytes +of the snapshot as they are written to the stream. This reduces the total number of bytes that +need to be written to the replicated log at the cost of CPU and generally reduces the amount +of time a snapshot takes. Most users will want to enable both compression and deduplication.</p> + +<h3 id="enabling-compression">Enabling Compression</h3> + +<p>Snapshot compression is enabled via the <code>-deflate_snapshots</code> flag. This is the default since +Aurora 0.5.0. All released versions of Aurora can read both compressed and uncompressed snapshots, +so there are no backwards compatibility concerns associated with changing this flag.</p> + +<h3 id="disabling-compression">Disabling compression</h3> + +<p>Disable compression by passing <code>-deflate_snapshots=false</code>.</p> + +<h2 id="deduplication">Deduplication</h2> + +<p>In Aurora 0.6.0 a new snapshot format was introduced. Rather than write one configuration blob +per Mesos task this format stores each configuration blob once, and each Mesos task with a +pointer to its blob. This format is not backwards compatible with earlier versions of Aurora.</p> + +<h3 id="enabling-deduplication">Enabling Deduplication</h3> + +<p>After upgrading Aurora to 0.6.0, enable deduplication with the <code>-deduplicate_snapshots</code> flag. +After the first snapshot the cluster will be using the deduplicated format to write to the +replicated log. Snapshots are created periodically by the scheduler (according to +the <code>-dlog_snapshot_interval</code> flag). An administrator can also force a snapshot operation with +<code>aurora_admin snapshot</code>.</p> + +<h3 id="disabling-deduplication">Disabling Deduplication</h3> + +<p>To disable deduplication, for example to rollback to Aurora, restart all of the cluster’s +schedulers with <code>-deduplicate_snapshots=false</code> and either wait for a snapshot or force one +using <code>aurora_admin snapshot</code>.</p> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html> Added: aurora/site/publish/documentation/0.11.0/security/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/security/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/security/index.html (added) +++ aurora/site/publish/documentation/0.11.0/security/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,372 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/security/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<p>Aurora integrates with <a href="http://shiro.apache.org/">Apache Shiro</a> to provide security +controls for its API. In addition to providing some useful features out of the box, Shiro +also allows Aurora cluster administrators to adapt the security system to their organizationâs +existing infrastructure.</p> + +<ul> +<li><a href="#enabling-security">Enabling Security</a></li> +<li><a href="#authentication">Authentication</a> + +<ul> +<li><a href="#http-basic-authentication">HTTP Basic Authentication</a> + +<ul> +<li><a href="#server-configuration">Server Configuration</a></li> +<li><a href="#client-configuration">Client Configuration</a></li> +</ul></li> +<li><a href="#http-spnego-authentication-kerberos">HTTP SPNEGO Authentication (Kerberos)</a> + +<ul> +<li><a href="#server-configuration-1">Server Configuration</a></li> +<li><a href="#client-configuration-1">Client Configuration</a></li> +</ul></li> +</ul></li> +<li><a href="#authorization">Authorization</a> + +<ul> +<li><a href="#using-an-ini-file-to-define-security-controls">Using an INI file to define security controls</a> + +<ul> +<li><a href="#caveats">Caveats</a></li> +</ul></li> +</ul></li> +<li><a href="#implementing-a-custom-realm">Implementing a Custom Realm</a> + +<ul> +<li><a href="#packaging-a-realm-module">Packaging a realm module</a></li> +</ul></li> +<li><a href="#known-issues">Known Issues</a></li> +</ul> + +<h1 id="enabling-security">Enabling Security</h1> + +<p>There are two major components of security: +<a href="http://en.wikipedia.org/wiki/Authentication#Authorization">authentication and authorization</a>. A +cluster administrator may choose the approach used for each, and may also implement custom +mechanisms for either. Later sections describe the options available.</p> + +<h1 id="authentication">Authentication</h1> + +<p>The scheduler must be configured with instructions for how to process authentication +credentials at a minimum. There are currently two built-in authentication schemes - +<a href="http://en.wikipedia.org/wiki/Basic_access_authentication">HTTP Basic Authentication</a>, and +<a href="http://en.wikipedia.org/wiki/SPNEGO">SPNEGO</a> (Kerberos).</p> + +<h2 id="http-basic-authentication">HTTP Basic Authentication</h2> + +<p>Basic Authentication is a very quick way to add <em>some</em> security. It is supported +by all major browsers and HTTP client libraries with minimal work. However, +before relying on Basic Authentication you should be aware of the <a href="http://tools.ietf.org/html/rfc2617#section-4">security +considerations</a>.</p> + +<h3 id="server-configuration">Server Configuration</h3> + +<p>At a minimum you need to set 4 command-line flags on the scheduler:</p> +<pre class="highlight plaintext"><code>-http_authentication_mechanism=BASIC +-shiro_realm_modules=INI_AUTHNZ +-shiro_ini_path=path/to/security.ini +</code></pre> + +<p>And create a security.ini file like so:</p> +<pre class="highlight plaintext"><code>[users] +sally = apple, admin + +[roles] +admin = * +</code></pre> + +<p>The details of the security.ini file are explained below. Note that this file contains plaintext, +unhashed passwords.</p> + +<h3 id="client-configuration">Client Configuration</h3> + +<p>To configure the client for HTTP Basic authentication, add an entry to ~/.netrc with your credentials</p> +<pre class="highlight plaintext"><code>% cat ~/.netrc +# ... + +machine aurora.example.com +login sally +password apple + +# ... +</code></pre> + +<p>No changes are required to <code>clusters.json</code>.</p> + +<h2 id="http-spnego-authentication-kerberos">HTTP SPNEGO Authentication (Kerberos)</h2> + +<h3 id="server-configuration">Server Configuration</h3> + +<p>At a minimum you need to set 6 command-line flags on the scheduler:</p> +<pre class="highlight plaintext"><code>-http_authentication_mechanism=NEGOTIATE +-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ +-kerberos_server_principal=HTTP/[email protected] +-kerberos_server_keytab=path/to/aurora.example.com.keytab +-shiro_ini_path=path/to/security.ini +</code></pre> + +<p>And create a security.ini file like so:</p> +<pre class="highlight plaintext"><code>% cat path/to/security.ini +[users] +sally = _, admin + +[roles] +admin = * +</code></pre> + +<p>What’s going on here? First, Aurora must be configured to request Kerberos credentials when presented with an +unauthenticated request. This is achieved by setting</p> +<pre class="highlight plaintext"><code>-http_authentication_mechanism=NEGOTIATE +</code></pre> + +<p>Next, a Realm module must be configured to <strong>authenticate</strong> the current request using the Kerberos +credentials that were requested. Aurora ships with a realm module that can do this</p> +<pre class="highlight plaintext"><code>-shiro_realm_modules=KERBEROS5_AUTHN[,...] +</code></pre> + +<p>The Kerberos5Realm requires a keytab file and a server principal name. The principal name will usually +be in the form <code>HTTP/[email protected]</code>.</p> +<pre class="highlight plaintext"><code>-kerberos_server_principal=HTTP/[email protected] +-kerberos_server_keytab=path/to/aurora.example.com.keytab +</code></pre> + +<p>The Kerberos5 realm module is authentication-only. For scheduler security to work you must also +enable a realm module that provides an Authorizer implementation. For example, to do this using the +IniShiroRealmModule:</p> +<pre class="highlight plaintext"><code>-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ +</code></pre> + +<p>You can then configure authorization using a security.ini file as described below +(the password field is ignored). You must configure the realm module with the path to this file:</p> +<pre class="highlight plaintext"><code>-shiro_ini_path=path/to/security.ini +</code></pre> + +<h3 id="client-configuration">Client Configuration</h3> + +<p>To use Kerberos on the client-side you must build Kerberos-enabled client binaries. Do this with</p> +<pre class="highlight plaintext"><code>./pants binary src/main/python/apache/aurora/kerberos:kaurora +./pants binary src/main/python/apache/aurora/kerberos:kaurora_admin +</code></pre> + +<p>You must also configure each cluster where you’ve enabled Kerberos on the scheduler +to use Kerberos authentication. Do this by setting <code>auth_mechanism</code> to <code>KERBEROS</code> +in <code>clusters.json</code>.</p> +<pre class="highlight plaintext"><code>% cat ~/.aurora/clusters.json +{ + "devcluser": { + "auth_mechanism": "KERBEROS", + ... + }, + ... +} +</code></pre> + +<h1 id="authorization">Authorization</h1> + +<p>Given a means to authenticate the entity a client claims they are, we need to define what privileges they have.</p> + +<h2 id="using-an-ini-file-to-define-security-controls">Using an INI file to define security controls</h2> + +<p>The simplest security configuration for Aurora is an INI file on the scheduler. For small +clusters, or clusters where the users and access controls change relatively infrequently, this is +likely the preferred approach. However you may want to avoid this approach if access permissions +are rapidly changing, or if your access control information already exists in another system.</p> + +<p>You can enable INI-based configuration with following scheduler command line arguments:</p> +<pre class="highlight plaintext"><code>-http_authentication_mechanism=BASIC +-shiro_ini_path=path/to/security.ini +</code></pre> + +<p><em>note</em> As the argument name reveals, this is using Shiroâs +<a href="http://shiro.apache.org/configuration.html#Configuration-INIConfiguration">IniRealm</a> behind +the scenes.</p> + +<p>The INI file will contain two sections - users and roles. Hereâs an example for what might +be in security.ini:</p> +<pre class="highlight plaintext"><code>[users] +sally = apple, admin +jim = 123456, accounting +becky = letmein, webapp +larry = 654321,accounting +steve = password + +[roles] +admin = * +accounting = thrift.AuroraAdmin:setQuota +webapp = thrift.AuroraSchedulerManager:*:webapp +</code></pre> + +<p>The users section defines user user credentials and the role(s) they are members of. These lines +are of the format <code><user> = <password>[, <role>...]</code>. As you probably noticed, the passwords are +in plaintext and as a result read access to this file should be restricted.</p> + +<p>In this configuration, each user has different privileges for actions in the cluster because +of the roles they are a part of:</p> + +<ul> +<li>admin is granted all privileges</li> +<li>accounting may adjust the amount of resource quota for any role</li> +<li>webapp represents a collection of jobs that represents a service, and its members may create and modify any jobs owned by it</li> +</ul> + +<h3 id="caveats">Caveats</h3> + +<p>You might find documentation on the Internet suggesting there are additional sections in <code>shiro.ini</code>, +like <code>[main]</code> and <code>[urls]</code>. These are not supported by Aurora as it uses a different mechanism to configure +those parts of Shiro. Think of Aurora’s <code>security.ini</code> as a subset with only <code>[users]</code> and <code>[roles]</code> sections.</p> + +<h1 id="implementing-a-custom-realm">Implementing a Custom Realm</h1> + +<p>Since Auroraâs security is backed by <a href="https://shiro.apache.org">Apache Shiro</a>, you can implement a +custom <a href="http://shiro.apache.org/realm.html">Realm</a> to define organization-specific security behavior.</p> + +<p>In addition to using Shiro’s standard APIs to implement a Realm you can link against Aurora to +access the type-safe Permissions Aurora uses. See the Javadoc for <code>org.apache.aurora.scheduler.spi</code> +for more information.</p> + +<h2 id="packaging-a-realm-module">Packaging a realm module</h2> + +<p>Package your custom Realm(s) with a Guice module that exposes a <code>Set<Realm></code> multibinding.</p> +<pre class="highlight java"><code><span style="color: #000000;font-weight: bold">package</span> <span style="background-color: #f8f8f8">com</span><span style="color: #000000;font-weight: bold">.</span><span style="color: #008080">example</span><span style="color: #000000;font-weight: bold">;</span> + +<span style="color: #000000;font-weight: bold">import</span> <span style="color: #555555">com.google.inject.AbstractModule</span><span style="color: #000000;font-weight: bold">;</span> +<span style="color: #000000;font-weight: bold">import</span> <span style="color: #555555">com.google.inject.multibindings.Multibinder</span><span style="color: #000000;font-weight: bold">;</span> +<span style="color: #000000;font-weight: bold">import</span> <span style="color: #555555">org.apache.shiro.realm.Realm</span><span style="color: #000000;font-weight: bold">;</span> + +<span style="color: #000000;font-weight: bold">public</span> <span style="color: #000000;font-weight: bold">class</span> <span style="color: #445588;font-weight: bold">MyRealmModule</span> <span style="color: #000000;font-weight: bold">extends</span> <span style="background-color: #f8f8f8">AbstractModule</span> <span style="color: #000000;font-weight: bold">{</span> + <span style="color: #3c5d5d;font-weight: bold">@Override</span> + <span style="color: #000000;font-weight: bold">public</span> <span style="color: #445588;font-weight: bold">void</span> <span style="background-color: #f8f8f8">configure</span><span style="color: #000000;font-weight: bold">()</span> <span style="color: #000000;font-weight: bold">{</span> + <span style="background-color: #f8f8f8">Realm</span> <span style="background-color: #f8f8f8">myRealm</span> <span style="color: #000000;font-weight: bold">=</span> <span style="color: #000000;font-weight: bold">new</span> <span style="background-color: #f8f8f8">MyRealm</span><span style="color: #000000;font-weight: bold">();</span> + + <span style="background-color: #f8f8f8">Multibinder</span><span style="color: #000000;font-weight: bold">.</span><span style="color: #008080">newSetBinder</span><span style="color: #000000;font-weight: bold">(</span><span style="background-color: #f8f8f8">binder</span><span style="color: #000000;font-weight: bold">(),</span> <span style="background-color: #f8f8f8">Realm</span><span style="color: #000000;font-weight: bold">.</span><span style="color: #008080">class</span><span style="color: #000000;font-weight: bold">).</span><span style="color: #008080">addBinding</span><span style="color: #000000;font-weight: bold">().</span><span style="color: #008080">toInstance</span><span style="color: #000000;font-weight: bold">(</span><span style="background-color: #f8f8f8">myRealm</span><span style="color: #000000;font-weight: bold">);</span> + <span style="color: #000000;font-weight: bold">}</span> + + <span style="color: #000000;font-weight: bold">static</span> <span style="color: #000000;font-weight: bold">class</span> <span style="color: #445588;font-weight: bold">MyRealm</span> <span style="color: #000000;font-weight: bold">implements</span> <span style="background-color: #f8f8f8">Realm</span> <span style="color: #000000;font-weight: bold">{</span> + <span style="color: #999988;font-style: italic">// Realm implementation.</span> + <span style="color: #000000;font-weight: bold">}</span> +<span style="color: #000000;font-weight: bold">}</span> +</code></pre> + +<p>To use your module in the scheduler, include it as a realm module based on its fully-qualified +class name:</p> +<pre class="highlight plaintext"><code>-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ,com.example.MyRealmModule +</code></pre> + +<h1 id="known-issues">Known Issues</h1> + +<p>While the APIs and SPIs we ship with are stable as of 0.8.0, we are aware of several incremental +improvements. Please follow, vote, or send patches.</p> + +<p>Relevant tickets: +* <a href="https://issues.apache.org/jira/browse/AURORA-343">AURORA-343</a>: HTTPS support +* <a href="https://issues.apache.org/jira/browse/AURORA-1248">AURORA-1248</a>: Client retries 4xx errors +* <a href="https://issues.apache.org/jira/browse/AURORA-1279">AURORA-1279</a>: Remove kerberos-specific build targets +* <a href="https://issues.apache.org/jira/browse/AURORA-1291">AURORA-1293</a>: Consider defining a JSON format in place of INI +* <a href="https://issues.apache.org/jira/browse/AURORA-1179">AURORA-1179</a>: Supported hashed passwords in security.ini +* <a href="https://issues.apache.org/jira/browse/AURORA-1295">AURORA-1295</a>: Support security for the ReadOnlyScheduler service</p> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html> Added: aurora/site/publish/documentation/0.11.0/sla/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/sla/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/sla/index.html (added) +++ aurora/site/publish/documentation/0.11.0/sla/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,302 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/sla/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<h2 id="aurora-sla-measurement">Aurora SLA Measurement</h2> + +<ul> +<li><a href="#overview">Overview</a></li> +<li><a href="#metric-details">Metric Details</a> + +<ul> +<li><a href="#platform-uptime">Platform Uptime</a></li> +<li><a href="#job-uptime">Job Uptime</a></li> +<li><a href="#median-time-to-assigned-(mtta)">Median Time To Assigned (MTTA)</a></li> +<li><a href="#median-time-to-running-(mttr)">Median Time To Running (MTTR)</a></li> +</ul></li> +<li><a href="#limitations">Limitations</a></li> +</ul> + +<h2 id="overview">Overview</h2> + +<p>The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level +Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform +and hosted services.</p> + +<p>The Aurora SLA feature is by default only enabled for service (non-cron) +production jobs (<code>"production = True"</code> in your <code>.aurora</code> config). It can be enabled for +non-production services via the scheduler command line flag <code>-sla_non_prod_metrics</code>.</p> + +<p>Counters that track SLA measurements are computed periodically within the scheduler. +The individual instance metrics are refreshed every minute (configurable via +<code>sla_stat_refresh_interval</code>). The instance counters are subsequently aggregated by +relevant grouping types before exporting to scheduler <code>/vars</code> endpoint (when using <code>vagrant</code> +that would be <code>http://192.168.33.7:8081/vars</code>)</p> + +<h2 id="metric-details">Metric Details</h2> + +<h3 id="platform-uptime">Platform Uptime</h3> + +<p><em>Aggregate amount of time a job spends in a non-runnable state due to platform unavailability +or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on any +system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts +will not degrade this metric.</em></p> + +<p><strong>Collection scope:</strong></p> + +<ul> +<li>Per job - <code>sla_<job_key>_platform_uptime_percent</code></li> +<li>Per cluster - <code>sla_cluster_platform_uptime_percent</code></li> +</ul> + +<p><strong>Units:</strong> percent</p> + +<p>A fault in the task environment may cause the Aurora/Mesos to have different views on the task state +or lose track of the task existence. In such cases, the service task is marked as LOST and +rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING +for too long or the Mesos slave becomes unhealthy (or disappears completely). The time between +task entering LOST and its replacement reaching RUNNING state is counted towards platform downtime.</p> + +<p>Another example of a platform downtime event is the administrator-requested task rescheduling. This +happens during planned Mesos slave maintenance when all slave tasks are marked as DRAINED and +rescheduled elsewhere.</p> + +<p>To accurately calculate Platform Uptime, we must separate platform incurred downtime from user +actions that put a service instance in a non-operational state. It is simpler to isolate +user-incurred downtime and treat all other downtime as platform incurred.</p> + +<p>Currently, a user can cause a healthy service (task) downtime in only two ways: via <code>killTasks</code> +or <code>restartShards</code> RPCs. For both, their affected tasks leave an audit state transition trail +relevant to uptime calculations. By applying a special “SLA meaning” to exposed task state +transition records, we can build a deterministic downtime trace for every given service instance.</p> + +<p>A task going through a state transition carries one of three possible SLA meanings +(see <a href="https://github.com/apache/aurora/blob/#{git_tag}/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java">SlaAlgorithm.java</a>) for +sla-to-task-state mapping):</p> + +<ul> +<li><p>Task is UP: starts a period where the task is considered to be up and running from the Aurora +platform standpoint.</p></li> +<li><p>Task is DOWN: starts a period where the task cannot reach the UP state for some +non-user-related reason. Counts towards instance downtime.</p></li> +<li><p>Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to +user initiated action or failure. We ignore this period for the uptime calculation purposes.</p></li> +</ul> + +<p>This metric is recalculated over the last sampling period (last minute) to account for +any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the +sampling interval as well as adjacent REMOVED events.</p> + +<h3 id="job-uptime">Job Uptime</h3> + +<p><em>Percentage of the job instances considered to be in RUNNING state for the specified duration +relative to request time. This is a purely application side metric that is considering aggregate +uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect +this metric.</em></p> + +<p><strong>Collection scope:</strong> We currently expose job uptime values at 5 pre-defined +percentiles (50th,75th,90th,95th and 99th):</p> + +<ul> +<li><code>sla_<job_key>_job_uptime_50_00_sec</code></li> +<li><code>sla_<job_key>_job_uptime_75_00_sec</code></li> +<li><code>sla_<job_key>_job_uptime_90_00_sec</code></li> +<li><code>sla_<job_key>_job_uptime_95_00_sec</code></li> +<li><code>sla_<job_key>_job_uptime_99_00_sec</code></li> +</ul> + +<p><strong>Units:</strong> seconds +You can also get customized real-time stats from aurora client. See <code>aurora sla -h</code> for +more details.</p> + +<h3 id="median-time-to-assigned-mtta">Median Time To Assigned (MTTA)</h3> + +<p><em>Median time a job spends waiting for its tasks to be assigned to a host. This is a combined +metric that helps track the dependency of scheduling performance on the requested resources +(user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform scope).</em></p> + +<p><strong>Collection scope:</strong></p> + +<ul> +<li>Per job - <code>sla_<job_key>_mtta_ms</code></li> +<li>Per cluster - <code>sla_cluster_mtta_ms</code></li> +<li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in: +<a href="https://github.com/apache/aurora/blob/#{git_tag}/src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a>) + +<ul> +<li>By CPU:</li> +<li><code>sla_cpu_small_mtta_ms</code></li> +<li><code>sla_cpu_medium_mtta_ms</code></li> +<li><code>sla_cpu_large_mtta_ms</code></li> +<li><code>sla_cpu_xlarge_mtta_ms</code></li> +<li><code>sla_cpu_xxlarge_mtta_ms</code></li> +<li>By RAM:</li> +<li><code>sla_ram_small_mtta_ms</code></li> +<li><code>sla_ram_medium_mtta_ms</code></li> +<li><code>sla_ram_large_mtta_ms</code></li> +<li><code>sla_ram_xlarge_mtta_ms</code></li> +<li><code>sla_ram_xxlarge_mtta_ms</code></li> +<li>By DISK:</li> +<li><code>sla_disk_small_mtta_ms</code></li> +<li><code>sla_disk_medium_mtta_ms</code></li> +<li><code>sla_disk_large_mtta_ms</code></li> +<li><code>sla_disk_xlarge_mtta_ms</code></li> +<li><code>sla_disk_xxlarge_mtta_ms</code></li> +</ul></li> +</ul> + +<p><strong>Units:</strong> milliseconds</p> + +<p>MTTA only considers instances that have already reached ASSIGNED state and ignores those +that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource +constraints) do not affect metric curves.</p> + +<h3 id="median-time-to-running-mttr">Median Time To Running (MTTR)</h3> + +<p><em>Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric +reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.</em></p> + +<p><strong>Collection scope:</strong></p> + +<ul> +<li>Per job - <code>sla_<job_key>_mttr_ms</code></li> +<li>Per cluster - <code>sla_cluster_mttr_ms</code></li> +<li>Per instance size (small, medium, large, x-large, xx-large). Size are defined in: +<a href="https://github.com/apache/aurora/blob/#{git_tag}/src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java">ResourceAggregates.java</a>) + +<ul> +<li>By CPU:</li> +<li><code>sla_cpu_small_mttr_ms</code></li> +<li><code>sla_cpu_medium_mttr_ms</code></li> +<li><code>sla_cpu_large_mttr_ms</code></li> +<li><code>sla_cpu_xlarge_mttr_ms</code></li> +<li><code>sla_cpu_xxlarge_mttr_ms</code></li> +<li>By RAM:</li> +<li><code>sla_ram_small_mttr_ms</code></li> +<li><code>sla_ram_medium_mttr_ms</code></li> +<li><code>sla_ram_large_mttr_ms</code></li> +<li><code>sla_ram_xlarge_mttr_ms</code></li> +<li><code>sla_ram_xxlarge_mttr_ms</code></li> +<li>By DISK:</li> +<li><code>sla_disk_small_mttr_ms</code></li> +<li><code>sla_disk_medium_mttr_ms</code></li> +<li><code>sla_disk_large_mttr_ms</code></li> +<li><code>sla_disk_xlarge_mttr_ms</code></li> +<li><code>sla_disk_xxlarge_mttr_ms</code></li> +</ul></li> +</ul> + +<p><strong>Units:</strong> milliseconds</p> + +<p>MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with +unreasonable resource constraints) do not affect metric curves.</p> + +<h2 id="limitations">Limitations</h2> + +<ul> +<li><p>The availability of Aurora SLA metrics is bound by the scheduler availability.</p></li> +<li><p>All metrics are calculated at a pre-defined interval (currently set at 1 minute). +Scheduler restarts may result in missed collections.</p></li> +</ul> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html> Added: aurora/site/publish/documentation/0.11.0/storage-config/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.11.0/storage-config/index.html?rev=1721584&view=auto ============================================================================== --- aurora/site/publish/documentation/0.11.0/storage-config/index.html (added) +++ aurora/site/publish/documentation/0.11.0/storage-config/index.html Wed Dec 23 22:45:21 2015 @@ -0,0 +1,270 @@ +<!DOCTYPE html> +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <title>Apache Aurora</title> + <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.1/css/bootstrap.min.css"> + <link href="/assets/css/main.css" rel="stylesheet"> + <!-- Analytics --> + <script type="text/javascript"> + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-45879646-1']); + _gaq.push(['_setDomainName', 'apache.org']); + _gaq.push(['_trackPageview']); + + (function() { + var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; + ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; + var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + </script> + </head> + <body> + <div class="container-fluid section-header"> + <div class="container"> + <div class="nav nav-bar"> + <a href="/"><img src="/assets/img/aurora_logo_dkbkg.svg" width="300" alt="Transparent Apache Aurora logo with dark background"/></a> + <ul class="nav navbar-nav navbar-right"> + <li><a href="/documentation/latest/">Documentation</a></li> + <li><a href="/community/">Community</a></li> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/blog/">Blog</a></li> + </ul> + </div> + </div> +</div> + + <div class="container-fluid"> + <div class="container content"> + <div class="col-md-12 documentation"> +<h5 class="page-header text-uppercase">Documentation +<select onChange="window.location.href='/documentation/' + this.value + '/storage-config/'" + value="0.11.0"> + <option value="0.11.0" + selected="selected"> + 0.11.0 + (latest) + </option> + <option value="0.10.0" + > + 0.10.0 + </option> + <option value="0.9.0" + > + 0.9.0 + </option> + <option value="0.8.0" + > + 0.8.0 + </option> + <option value="0.7.0-incubating" + > + 0.7.0-incubating + </option> + <option value="0.6.0-incubating" + > + 0.6.0-incubating + </option> + <option value="0.5.0-incubating" + > + 0.5.0-incubating + </option> +</select> +</h5> +<h1 id="storage-configuration-and-maintenance">Storage Configuration And Maintenance</h1> + +<ul> +<li><a href="#overview">Overview</a></li> +<li><a href="#scheduler-storage-configuration-flags">Scheduler storage configuration flags</a> + +<ul> +<li><a href="#mesos-replicated-log-configuration-flags">Mesos replicated log configuration flags</a></li> +<li><a href="#-native_log_quorum_size">-native<em>log</em>quorum_size</a></li> +<li><a href="#-native_log_file_path">-native<em>log</em>file_path</a></li> +<li><a href="#-native_log_zk_group_path">-native<em>log</em>zk<em>group</em>path</a></li> +<li><a href="#backup-configuration-flags">Backup configuration flags</a></li> +<li><a href="#-backup_interval">-backup_interval</a></li> +<li><a href="#-backup_dir">-backup_dir</a></li> +<li><a href="#-max_saved_backups">-max<em>saved</em>backups</a></li> +</ul></li> +<li><a href="#recovering-from-a-scheduler-backup">Recovering from a scheduler backup</a> + +<ul> +<li><a href="#summary">Summary</a></li> +<li><a href="#preparation">Preparation</a></li> +<li><a href="#cleanup-and-re-initialize-mesos-replicated-log">Cleanup and re-initialize Mesos replicated log</a></li> +<li><a href="#restore-from-backup">Restore from backup</a></li> +<li><a href="#cleanup">Cleanup</a></li> +</ul></li> +</ul> + +<h2 id="overview">Overview</h2> + +<p>This document summarizes Aurora storage configuration and maintenance details and is +intended for use by anyone deploying and/or maintaining Aurora.</p> + +<p>For a high level overview of the Aurora storage architecture refer to <a href="/documentation/0.11.0/storage/">this document</a>.</p> + +<h2 id="scheduler-storage-configuration-flags">Scheduler storage configuration flags</h2> + +<p>Below is a summary of scheduler storage configuration flags that either don’t have default values +or require attention before deploying in a production environment.</p> + +<h3 id="mesos-replicated-log-configuration-flags">Mesos replicated log configuration flags</h3> + +<h4 id="nativelogquorum_size">-native<em>log</em>quorum_size</h4> + +<p>Defines the Mesos replicated log quorum size. See +<a href="/documentation/0.11.0/deploying-aurora-scheduler/#replicated-log-configuration">the replicated log configuration document</a> +on how to choose the right value.</p> + +<h4 id="nativelogfile_path">-native<em>log</em>file_path</h4> + +<p>Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably SSD) +for Mesos replicated log files to ensure optimal storage performance.</p> + +<h4 id="nativelogzkgrouppath">-native<em>log</em>zk<em>group</em>path</h4> + +<p>ZooKeeper path used for Mesos replicated log quorum discovery.</p> + +<p>See <a href="https://github.com/apache/aurora/blob/#{git_tag}/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java">code</a>) for +other available Mesos replicated log configuration options and default values.</p> + +<h3 id="backup-configuration-flags">Backup configuration flags</h3> + +<p>Configuration options for the Aurora scheduler backup manager.</p> + +<h4 id="backup_interval">-backup_interval</h4> + +<p>The interval on which the scheduler writes local storage backups. The default is every hour.</p> + +<h4 id="backup_dir">-backup_dir</h4> + +<p>Directory to write backups to.</p> + +<h4 id="maxsavedbackups">-max<em>saved</em>backups</h4> + +<p>Maximum number of backups to retain before deleting the oldest backup(s).</p> + +<h2 id="recovering-from-a-scheduler-backup">Recovering from a scheduler backup</h2> + +<ul> +<li><a href="#overview">Overview</a></li> +<li><a href="#preparation">Preparation</a></li> +<li><a href="#assess-mesos-replicated-log-damage">Assess Mesos replicated log damage</a></li> +<li><a href="#restore-from-backup">Restore from backup</a></li> +<li><a href="#cleanup">Cleanup</a></li> +</ul> + +<p><strong>Be sure to read the entire page before attempting to restore from a backup, as it may have +unintended consequences.</strong></p> + +<h3 id="summary">Summary</h3> + +<p>The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an +earlier, backed up, version and requires all schedulers to be taken down temporarily while +restoring. Once completed, the scheduler state resets to what it was when the backup was created. +This means any jobs/tasks created or updated after the backup are unknown to the scheduler and will +be killed shortly after the cluster restarts. All other tasks continue operating as normal.</p> + +<p>Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than a few +hours). This is because the scheduler will expect the cluster to look exactly as the backup does, +so any tasks that have been rescheduled since the backup was taken will be killed.</p> + +<h3 id="preparation">Preparation</h3> + +<p>Follow these steps to prepare the cluster for restoring from a backup:</p> + +<ul> +<li><p>Stop all scheduler instances</p></li> +<li><p>Consider blocking external traffic on a port defined in <code>-http_port</code> for all schedulers to +prevent users from interacting with the scheduler during the restoration process. This will help +troubleshooting by reducing the scheduler log noise and prevent users from making changes that will +be erased after the backup snapshot is restored</p></li> +<li><p>Next steps are required to put scheduler into a partially disabled state where it would still be +able to accept storage recovery requests but unable to schedule or change task states. This may be +accomplished by updating the following scheduler configuration options:</p> + +<ul> +<li>Set <code>-mesos_master_address</code> to a non-existent zk address. This will prevent scheduler from +registering with Mesos. E.g.: <code>-mesos_master_address=zk://localhost:2181</code></li> +<li><code>-max_registration_delay</code> - set to sufficiently long interval to prevent registration timeout +and as a result scheduler suicide. E.g: <code>-max_registration_delay=360mins</code></li> +<li>Make sure <code>-reconciliation_initial_delay</code> option is set high enough (e.g.: <code>365days</code>) to +prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster +state and will kill all tasks when restarted with an empty Mesos replicated log.</li> +</ul></li> +<li><p>Restart all schedulers</p></li> +</ul> + +<h3 id="cleanup-and-re-initialize-mesos-replicated-log">Cleanup and re-initialize Mesos replicated log</h3> + +<p>Get rid of the corrupted files and re-initialize Mesos replicate log:</p> + +<ul> +<li>Stop schedulers</li> +<li>Delete all files under <code>-native_log_file_path</code> on all schedulers</li> +<li>Initialize Mesos replica’s log file: <code>mesos-log initialize --path=<-native_log_file_path></code></li> +<li>Restart schedulers</li> +</ul> + +<h3 id="restore-from-backup">Restore from backup</h3> + +<p>At this point the scheduler is ready to rehydrate from the backup:</p> + +<ul> +<li><p>Identify the leading scheduler by:</p> + +<ul> +<li>running <code>aurora_admin get_scheduler <cluster></code> - if scheduler is responsive</li> +<li>examining scheduler logs</li> +<li>or examining Zookeeper registration under the path defined by <code>-zk_endpoints</code> +and <code>-serverset_path</code></li> +</ul></li> +<li><p>Locate the desired backup file, copy it to the leading scheduler and stage recovery by running +the following command on a leader +<code>aurora_admin scheduler_stage_recovery <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm></code></p></li> +<li><p>At this point, the recovery snapshot is staged and available for manual verification/modification +via <code>aurora_admin scheduler_print_recovery_tasks</code> and <code>scheduler_delete_recovery_tasks</code> commands. +See <code>aurora_admin help <command></code> for usage details.</p></li> +<li><p>Commit recovery. This instructs the scheduler to overwrite the existing Mesosreplicated log with +the provided backup snapshot and initiate a mandatory failover +<code>aurora_admin scheduler_commit_recovery <cluster></code></p></li> +</ul> + +<h3 id="cleanup">Cleanup</h3> + +<p>Undo any modification done during <a href="#preparation">Preparation</a> sequence.</p> + +</div> + + </div> + </div> + <div class="container-fluid section-footer buffer"> + <div class="container"> + <div class="row"> + <div class="col-md-2 col-md-offset-1"><h3>Quick Links</h3> + <ul> + <li><a href="/downloads/">Downloads</a></li> + <li><a href="/community/">Mailing Lists</a></li> + <li><a href="http://issues.apache.org/jira/browse/AURORA">Issue Tracking</a></li> + <li><a href="/documentation/latest/contributing/">How To Contribute</a></li> + </ul> + </div> + <div class="col-md-2"><h3>The ASF</h3> + <ul> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </div> + <div class="col-md-6"> + <p class="disclaimer">Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. The <a href="https://www.flickr.com/photos/trondk/12706051375/">Aurora Borealis IX photo</a> displayed on the homepage is available under a <a href="https://creativecommons.org/licenses/by-nc-nd/2.0/">Creative Commons BY-NC-ND 2.0 license</a>. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.</p> + </div> + </div> + </div> + + </body> +</html>
