rolehistory.md

stevel Mon, 11 May 2015 12:43:57 -0700

Author: stevel
Date: Mon May 11 19:43:05 2015
New Revision: 1678807

URL: http://svn.apache.org/r1678807
Log:
SLIDER-856 slider needs to treat pre-emption events as not-a-real-failure


Modified:
    incubator/slider/site/trunk/content/design/rolehistory.md

Modified: incubator/slider/site/trunk/content/design/rolehistory.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/design/rolehistory.md?rev=1678807&r1=1678806&r2=1678807&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/design/rolehistory.md (original)
+++ incubator/slider/site/trunk/content/design/rolehistory.md Mon May 11 
19:43:05 2015
@@ -33,11 +33,25 @@ A major rework of placement has taken pl
 that have reached their escalation timeout and yet have not been satisfied.
 1. Such requests are cancelled and "relaxed" requests re-issued.
 1. Labels are always respected; even relaxed requests use any labels specified 
in `resources.json`
-1. If a node is considered unreliable (as per-the slider 0.70 changes), it is 
not used in the initial
+1. If a node is considered unreliable (as per-the slider-0.70-incubating 
changes), it is not used in the initial
 request. YARN may still allocate relaxed instances on such nodes. That is: 
there is no explicit
 blacklisting, merely deliberate exclusion of unreliable nodes from explicitly 
placed requests.
+1. Node and component failure counts are reset on a regular schedule. The 
"recently failed"
+counters are the ones used to decide if a node is unreliable or a component 
has failed too 
+many times. Long-lived applications can therefore tolerate a low rate of 
component failures.
+1. The notion of "failed" differentiates between application failures, node 
failures and
+pre-emption.
+    * YARN container pre-emption is not considered a failure.
+    * Node failures are: anything reported as such by YARN, and any unexpected 
application exit
+    (as these may be caused by node-related issues; port conflict with other 
applications...etc)
+    * Application failures are resource limits being exceeded (RAM, VRAM), and 
unexpected application
+    exit.
+    * Only "application failures" are added to the "failed recently" count 
âand so only they are 
+      used to decide whether a component has a failed too many times for the 
application
+      to be considered working.
+  
 
-Role History Reloading Enhancements
+##### Role History Reloading Enhancements
 
 How persisted role history has also been improved 
[SLIDER-600]((https://issues.apache.org/jira/browse/SLIDER-600)

svn commit: r1678807 - /incubator/slider/site/trunk/content/design/rolehistory.md

Reply via email to