[ 
https://issues.apache.org/jira/browse/MESOS-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798373#comment-15798373
 ] 

Alexander Rukletsov commented on MESOS-6833:
--------------------------------------------

If we make killing on the executor optional, we should provide more data to the 
scheduler about the health status. Here is what I have in mind:
{code}
message TaskStatus {
 <...>

 // Describes whether the task has been determined to be healthy
 // (true) or unhealthy (false) according to the HealthCheck field in
 // the command info.
 //
 // NOTE: This field will be deprecated in favor of a more verbose
 // `health_status` starting from 2.0.
 optional bool healthy = 8;

 // Contains health status for the health check specified in corresponding
 // `TaskInfo`. If no health check has been specified, this field must be
 // absent, otherwise it must be present even if the health status is not
 // available yet.
 //
 // NOTE: A task status update must be sent if:
 // 1) The health check fails, regardless of the previous value of
 //    `HealthStatusInfo.healthy`.
 // 2) the value or presence of the `HealthStatusInfo.healthy` field changes.
 optional HealthStatusInfo health_status = 15;

 <...>
}
{code}
{code}
/**
* Describes the status of a health check. An empty message means that the
* status is currently not available, for example, due to the health check
* being in a grace period.
*/
message HealthStatusInfo {
 // Contains either command exit code, HTTP status code, or TCP handshake
 // return code.
 optional int32 state = 1;

 // Executor must decide locally whether the task is healthy or not based
 // on the health check specification in corresponding `TaskInfo`.
 optional bool healthy = 2;
 
 // This field tells how many times the health check failed consecutively.
 // It is particularly useful after scheduler failover or disconnect if the
 // task killing decision is delegated to the scheduler.
 optional uint32 consecutive_failures = 3;
}
{code}

> consecutive_failures 0 == 1 in HealthCheck.
> -------------------------------------------
>
>                 Key: MESOS-6833
>                 URL: https://issues.apache.org/jira/browse/MESOS-6833
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 0.28.0, 1.0.0, 1.1.0
>            Reporter: Lukas Loesche
>              Labels: health-check, mesosphere
>
> When defining a HealthCheck with consecutive_failures=0 one would expect 
> Mesos to never kill the task and only notify about the failure.
> What seems to happen instead is Mesos handles consecutive_failures=0 as 
> consecutive_failures=1 and kills the task after 1 failure.
> Since 0 isn't the same as 1 this seems to be a bug and results in unexpected 
> behaviour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to