Author: buildbot
Date: Mon May 11 19:43:11 2015
New Revision: 950969

Log:
Staging update by buildbot for slider

Modified:
    websites/staging/slider/trunk/content/   (props changed)
    websites/staging/slider/trunk/content/design/rolehistory.html

Propchange: websites/staging/slider/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon May 11 19:43:11 2015
@@ -1 +1 @@
-1678460
+1678807

Modified: websites/staging/slider/trunk/content/design/rolehistory.html
==============================================================================
--- websites/staging/slider/trunk/content/design/rolehistory.html (original)
+++ websites/staging/slider/trunk/content/design/rolehistory.html Mon May 11 
19:43:11 2015
@@ -191,11 +191,26 @@ Latest release: <strong>0.70.1-incubatin
 that have reached their escalation timeout and yet have not been 
satisfied.</li>
 <li>Such requests are cancelled and "relaxed" requests re-issued.</li>
 <li>Labels are always respected; even relaxed requests use any labels 
specified in <code>resources.json</code></li>
-<li>If a node is considered unreliable (as per-the slider 0.70 changes), it is 
not used in the initial
+<li>If a node is considered unreliable (as per-the slider-0.70-incubating 
changes), it is not used in the initial
 request. YARN may still allocate relaxed instances on such nodes. That is: 
there is no explicit
 blacklisting, merely deliberate exclusion of unreliable nodes from explicitly 
placed requests.</li>
+<li>Node and component failure counts are reset on a regular schedule. The 
"recently failed"
+counters are the ones used to decide if a node is unreliable or a component 
has failed too 
+many times. Long-lived applications can therefore tolerate a low rate of 
component failures.</li>
+<li>The notion of "failed" differentiates between application failures, node 
failures and
+pre-emption.<ul>
+<li>YARN container pre-emption is not considered a failure.</li>
+<li>Node failures are: anything reported as such by YARN, and any unexpected 
application exit
+(as these may be caused by node-related issues; port conflict with other 
applications...etc)</li>
+<li>Application failures are resource limits being exceeded (RAM, VRAM), and 
unexpected application
+exit.</li>
+<li>Only "application failures" are added to the "failed recently" count 
—and so only they are 
+  used to decide whether a component has a failed too many times for the 
application
+  to be considered working.</li>
+</ul>
+</li>
 </ol>
-<p>Role History Reloading Enhancements</p>
+<h5 id="role-history-reloading-enhancements">Role History Reloading 
Enhancements</h5>
 <p>How persisted role history has also been improved 
[SLIDER-600]((https://issues.apache.org/jira/browse/SLIDER-600)</p>
 <ol>
 <li>Reloading of persistent history has been made resilient to changes in the 
number of roles.</li>


Reply via email to