[
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184378#comment-16184378
]
Gour Saha edited comment on SLIDER-1246 at 9/28/17 6:46 PM:
------------------------------------------------------------
4 new resources config properties have been introduced in the patch which
should provide the health-threshold control required for this feature -
{code}
yarn.container.health.threshold.percent
e.g.
"yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are
100 containers for a component, then 80 or more running containers will deem
the component as healthy)
{code}
There is no default, so needs to be explicitly set in resources file to enable
health monitor. It can be defined at the global level to enable monitors for
all components. When defined at global-level the same percent is applicable for
all components. It can be defined at component-level also to override the
global value for a specific component. Note, if health monitor is enabled for a
component then failure threshold is automatically disabled for that component.
So, if a value is set for *_yarn.container.failure.threshold_* as well, it will
be ignored for that component. If health-threshold is set for one component and
failure threshold for another, they will compete with each other against
determining an app to be unhealthy. Whoever wins brings the app down first, so
the app owners need to understand the behavior of these competing properties
and set them appropriately.
{code}
yarn.container.health.threshold.window.secs
e.g.
"yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1
hour)
{code}
Default is 600 secs (10 mins). The amount of time a component is allowed to be
below the health-threshold percent after which the application is stopped. If
the health crosses above threshold before the window expires then this window
is reset to 0. So, if the health goes below threshold later again, it has to be
there for the entire window to be considered unhealthy.
{code}
yarn.container.health.threshold.poll.frequency.secs
e.g.
"yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll
frequency to 20 secs)
{code}
The frequency at which the monitor wakes up and checks the component health.
Default is 10 secs. For most purposes, this property does not need to be set by
the application owner, unless the app owner knows exactly what she/he is doing.
{code}
yarn.container.health.threshold.init.delay.secs
e.g.
"yarn.container.health.threshold.init.delay.secs" : "400" (sets the window to
400 secs)
{code}
Default is 600 secs. This provides an additional lead time before the health
monitor kicks in to do its job. Note, the component health check timer will
start at the end of this init delay. It is used to provide an extra bit of lead
time to the application to bring up its containers the first time it is started
(while it is working its way up to cross the health-threshold percent). Once
the init delay time expires, and a component is still below health threshold
percent, the monitor kicks in and waits for
*_yarn.container.health.threshold.window.secs_* more time before it stops the
app (assuming it never crossed the threshold percent). Hence when the app
starts it technically gets yarn.container.health.threshold.init.delay.secs +
yarn.container.health.threshold.window.secs time to cross health threshold
percent. If no extra initial lead time is required, set it to 0.
Note, the node failure blacklisting feature is implemented by SLIDER-1199.
was (Author: gsaha):
4 new resources config properties have been introduced in the patch which
should provide the health-threshold control required for this feature -
{code}
yarn.container.health.threshold.percent
e.g.
"yarn.container.health.threshold.percent" : "80" (set to 80%, so if there are
100 containers for a component, then 80 or more running containers will deem
the component as healthy)
{code}
There is no default, so needs to be explicitly set in resources file to enable
health monitor. It can be defined at the global level to enable monitors for
all components. When defined at global-level the same percent is applicable for
all components. It can be defined at component-level also to override the
global value for a specific component. Note, if health monitor is enabled for a
component then failure threshold is automatically disabled for that component.
So, if a value is set for *_yarn.container.failure.threshold_* as well, it will
be ignored for that component. If health-threshold is set for one component and
failure threshold for another, they will compete with each other against
determining an app to be unhealthy. Whoever wins brings the app down first, so
the app owners need to understand the behavior of these competing properties
and set them appropriately.
{code}
yarn.container.health.threshold.window.secs
e.g.
"yarn.container.health.threshold.window.secs" : "3600" (sets the window to 1
hour)
{code}
Default is 600 secs (5 mins). The amount of time a component is allowed to be
below the health-threshold percent after which the application is stopped. If
the health crosses above threshold before the window expires then this window
is reset to 0. So, if the health goes below threshold later again, it has to be
there for the entire window to be considered unhealthy.
{code}
yarn.container.health.threshold.poll.frequency.secs
e.g.
"yarn.container.health.threshold.poll.frequency.secs" : "20" (sets poll
frequency to 20 secs)
{code}
Default is 10 secs. For most purposes this property does not need to be set by
the application owner, unless the app owner knows exactly what she/he is doing.
{code}
yarn.container.health.threshold.init.delay.secs
e.g.
"yarn.container.health.threshold.init.delay.secs" : "1800" (sets the window to
30 mins)
{code}
Default is 600 secs (same as default for
*_yarn.container.health.threshold.window.secs_*). Controls the health monitor's
behavior the exact same way as *_yarn.container.health.threshold.window.secs_*
does, except that it comes into play only the first time when the application
is started while it is working its way up to cross the health-threshold percent
for the first time.
Note, the node failure blacklisting feature is implemented by SLIDER-1199.
> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
> Issue Type: Bug
> Components: appmaster, core
> Affects Versions: Slider 0.92
> Reporter: Prasanth Jayachandran
> Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an
> application failure.
> Observed this in HIVE-16927, where container failures in certain nodes brings
> down entire application. Slider has to provide a way to not mark application
> as unhealthy if certain threshold of containers are running. Tuning failure
> threshold is not optimal as setting the correct default on large cluster is
> not trivial. Beyond certain failures, slider should mark the node as
> unhealthy and report that back to client/AM. Application could continue to
> run as long as container request is satisfied partially (example: 80%
> containers are running).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)