Author: buildbot
Date: Mon May 11 19:43:11 2015
New Revision: 950969
Log:
Staging update by buildbot for slider
Modified:
websites/staging/slider/trunk/content/ (props changed)
websites/staging/slider/trunk/content/design/rolehistory.html
Propchange: websites/staging/slider/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon May 11 19:43:11 2015
@@ -1 +1 @@
-1678460
+1678807
Modified: websites/staging/slider/trunk/content/design/rolehistory.html
==============================================================================
--- websites/staging/slider/trunk/content/design/rolehistory.html (original)
+++ websites/staging/slider/trunk/content/design/rolehistory.html Mon May 11
19:43:11 2015
@@ -191,11 +191,26 @@ Latest release: <strong>0.70.1-incubatin
that have reached their escalation timeout and yet have not been
satisfied.</li>
<li>Such requests are cancelled and "relaxed" requests re-issued.</li>
<li>Labels are always respected; even relaxed requests use any labels
specified in <code>resources.json</code></li>
-<li>If a node is considered unreliable (as per-the slider 0.70 changes), it is
not used in the initial
+<li>If a node is considered unreliable (as per-the slider-0.70-incubating
changes), it is not used in the initial
request. YARN may still allocate relaxed instances on such nodes. That is:
there is no explicit
blacklisting, merely deliberate exclusion of unreliable nodes from explicitly
placed requests.</li>
+<li>Node and component failure counts are reset on a regular schedule. The
"recently failed"
+counters are the ones used to decide if a node is unreliable or a component
has failed too
+many times. Long-lived applications can therefore tolerate a low rate of
component failures.</li>
+<li>The notion of "failed" differentiates between application failures, node
failures and
+pre-emption.<ul>
+<li>YARN container pre-emption is not considered a failure.</li>
+<li>Node failures are: anything reported as such by YARN, and any unexpected
application exit
+(as these may be caused by node-related issues; port conflict with other
applications...etc)</li>
+<li>Application failures are resource limits being exceeded (RAM, VRAM), and
unexpected application
+exit.</li>
+<li>Only "application failures" are added to the "failed recently" count
âand so only they are
+ used to decide whether a component has a failed too many times for the
application
+ to be considered working.</li>
+</ul>
+</li>
</ol>
-<p>Role History Reloading Enhancements</p>
+<h5 id="role-history-reloading-enhancements">Role History Reloading
Enhancements</h5>
<p>How persisted role history has also been improved
[SLIDER-600]((https://issues.apache.org/jira/browse/SLIDER-600)</p>
<ol>
<li>Reloading of persistent history has been made resilient to changes in the
number of roles.</li>