Author: stevel
Date: Mon May 11 19:43:05 2015
New Revision: 1678807
URL: http://svn.apache.org/r1678807
Log:
SLIDER-856 slider needs to treat pre-emption events as not-a-real-failure
Modified:
incubator/slider/site/trunk/content/design/rolehistory.md
Modified: incubator/slider/site/trunk/content/design/rolehistory.md
URL:
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/design/rolehistory.md?rev=1678807&r1=1678806&r2=1678807&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/design/rolehistory.md (original)
+++ incubator/slider/site/trunk/content/design/rolehistory.md Mon May 11
19:43:05 2015
@@ -33,11 +33,25 @@ A major rework of placement has taken pl
that have reached their escalation timeout and yet have not been satisfied.
1. Such requests are cancelled and "relaxed" requests re-issued.
1. Labels are always respected; even relaxed requests use any labels specified
in `resources.json`
-1. If a node is considered unreliable (as per-the slider 0.70 changes), it is
not used in the initial
+1. If a node is considered unreliable (as per-the slider-0.70-incubating
changes), it is not used in the initial
request. YARN may still allocate relaxed instances on such nodes. That is:
there is no explicit
blacklisting, merely deliberate exclusion of unreliable nodes from explicitly
placed requests.
+1. Node and component failure counts are reset on a regular schedule. The
"recently failed"
+counters are the ones used to decide if a node is unreliable or a component
has failed too
+many times. Long-lived applications can therefore tolerate a low rate of
component failures.
+1. The notion of "failed" differentiates between application failures, node
failures and
+pre-emption.
+ * YARN container pre-emption is not considered a failure.
+ * Node failures are: anything reported as such by YARN, and any unexpected
application exit
+ (as these may be caused by node-related issues; port conflict with other
applications...etc)
+ * Application failures are resource limits being exceeded (RAM, VRAM), and
unexpected application
+ exit.
+ * Only "application failures" are added to the "failed recently" count
âand so only they are
+ used to decide whether a component has a failed too many times for the
application
+ to be considered working.
+
-Role History Reloading Enhancements
+##### Role History Reloading Enhancements
How persisted role history has also been improved
[SLIDER-600]((https://issues.apache.org/jira/browse/SLIDER-600)