[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

Benjamin Mahler (JIRA) Thu, 10 Dec 2015 10:02:53 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051371#comment-15051371
 ]


Benjamin Mahler commented on MESOS-4106:
----------------------------------------

I'm not sure we should say sleeping provides a "very weak guarantee", there is 
indeed *no guarantee* with a sleep that the message is sent.

The approach you've suggested with querying with a timeout still provides no 
form of guarantee, unless you are going to wait indefinitely or use the timeout 
mentioned to trigger a retry rather than an exit (what did you intend to happen 
after the timeout?). This approach is guaranteeing application-level delivery, 
and we generally just use an "acknowledgement" message with retries to do this, 
rather than a separate query.

However, since the executor resides on the same machine, and executor failover 
is not supported, we're unlikely to bother implementing acknowledgements with 
retries here. We only need to wait for the data to be sent on the socket (this 
gives a "weak guarantee": e.g. if there are no socket errors (note that both 
ends of the socket are within the same machine), and the executor remains up, 
the message will eventually be processed by the executor). MESOS-4111 discusses 
the general issue of being able to exit after ensuring that messages are 
processed in libprocess.

In the case of the long-standing command executor sleep, we needed to handle 
agent failure. So we are already using acknowledgements there, and can use them 
to {{stop()}} cleanly.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-4106
>                 URL: https://issues.apache.org/jira/browse/MESOS-4106
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Blocker
>             Fix For: 0.27.0
>
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

Reply via email to