[
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16185014#comment-16185014
]
Gour Saha commented on SLIDER-1246:
-----------------------------------
Thanks [~billie.rinaldi] I am incorporating all your comments now.
On this point -
{quote}
as discussed previously offline, i don't think the failure threshold should be
automatically disabled when the health percent is enabled. but since we
disagree on this, i am okay with having the automatic disable until someone
expresses interest in using both features
{quote}
I thought this over and figured that the failure threshold which is an absolute
value will always step into the way of a monitor which is driven by a percent
value. No matter what the absolute value we set for failure threshold, for a
component with high no of containers, it can potentially be less than the
absolute no of containers given by (100 - health.percent)%. Hence failure
threshold will always win in this scenario and is as good as not setting health
threshold in the first place. Also, with flex up and flex down health percent
will always scale accordingly, but the absolute value of failure threshold will
cease to make sense. It is also very difficult to document and provide a
usecase so that app owners will understand how the app health is tracked when
both failure threshold and health threshold are in play (for the same
component). Additionally, the current failure threshold logic counts a single
container failing multiple times (while all other n-1 containers are healthy)
the same as multiple containers failing at the same time and can result in the
app to be shutdown although effectively n-1 containers were always running
(unless it is saved by the blacklisting feature of node failure threshold when
set to a value less than failure threshold and if containers were cycling
through in the same node). This logic in health threshold is a significant
drift, since if n-1 containers are healthy and only 1 container fails multiple
times, it is counted only once.
If you still think that there is value to have both in play, then I can
introduce a boolean config which when set to true will let both be in play. Let
me know what you think?
> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
> Issue Type: Bug
> Components: appmaster, core
> Affects Versions: Slider 0.92
> Reporter: Prasanth Jayachandran
> Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch, SLIDER-1246.02.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an
> application failure.
> Observed this in HIVE-16927, where container failures in certain nodes brings
> down entire application. Slider has to provide a way to not mark application
> as unhealthy if certain threshold of containers are running. Tuning failure
> threshold is not optimal as setting the correct default on large cluster is
> not trivial. Beyond certain failures, slider should mark the node as
> unhealthy and report that back to client/AM. Application could continue to
> run as long as container request is satisfied partially (example: 80%
> containers are running).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)