[
https://issues.apache.org/jira/browse/MESOS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798373#comment-15798373
]
Alexander Rukletsov commented on MESOS-6833:
--------------------------------------------
If we make killing on the executor optional, we should provide more data to the
scheduler about the health status. Here is what I have in mind:
{code}
message TaskStatus {
<...>
// Describes whether the task has been determined to be healthy
// (true) or unhealthy (false) according to the HealthCheck field in
// the command info.
//
// NOTE: This field will be deprecated in favor of a more verbose
// `health_status` starting from 2.0.
optional bool healthy = 8;
// Contains health status for the health check specified in corresponding
// `TaskInfo`. If no health check has been specified, this field must be
// absent, otherwise it must be present even if the health status is not
// available yet.
//
// NOTE: A task status update must be sent if:
// 1) The health check fails, regardless of the previous value of
// `HealthStatusInfo.healthy`.
// 2) the value or presence of the `HealthStatusInfo.healthy` field changes.
optional HealthStatusInfo health_status = 15;
<...>
}
{code}
{code}
/**
* Describes the status of a health check. An empty message means that the
* status is currently not available, for example, due to the health check
* being in a grace period.
*/
message HealthStatusInfo {
// Contains either command exit code, HTTP status code, or TCP handshake
// return code.
optional int32 state = 1;
// Executor must decide locally whether the task is healthy or not based
// on the health check specification in corresponding `TaskInfo`.
optional bool healthy = 2;
// This field tells how many times the health check failed consecutively.
// It is particularly useful after scheduler failover or disconnect if the
// task killing decision is delegated to the scheduler.
optional uint32 consecutive_failures = 3;
}
{code}
> consecutive_failures 0 == 1 in HealthCheck.
> -------------------------------------------
>
> Key: MESOS-6833
> URL: https://issues.apache.org/jira/browse/MESOS-6833
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 0.28.0, 1.0.0, 1.1.0
> Reporter: Lukas Loesche
> Labels: health-check, mesosphere
>
> When defining a HealthCheck with consecutive_failures=0 one would expect
> Mesos to never kill the task and only notify about the failure.
> What seems to happen instead is Mesos handles consecutive_failures=0 as
> consecutive_failures=1 and kills the task after 1 failure.
> Since 0 isn't the same as 1 this seems to be a bug and results in unexpected
> behaviour.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)