[
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051371#comment-15051371
]
Benjamin Mahler commented on MESOS-4106:
----------------------------------------
I'm not sure we should say sleeping provides a "very weak guarantee", there is
indeed *no guarantee* with a sleep that the message is sent.
The approach you've suggested with querying with a timeout still provides no
form of guarantee, unless you are going to wait indefinitely or use the timeout
mentioned to trigger a retry rather than an exit (what did you intend to happen
after the timeout?). This approach is guaranteeing application-level delivery,
and we generally just use an "acknowledgement" message with retries to do this,
rather than a separate query.
However, since the executor resides on the same machine, and executor failover
is not supported, we're unlikely to bother implementing acknowledgements with
retries here. We only need to wait for the data to be sent on the socket (this
gives a "weak guarantee": e.g. if there are no socket errors (note that both
ends of the socket are within the same machine), and the executor remains up,
the message will eventually be processed by the executor). MESOS-4111 discusses
the general issue of being able to exit after ensuring that messages are
processed in libprocess.
In the case of the long-standing command executor sleep, we needed to handle
agent failure. So we are already using acknowledgements there, and can use them
to {{stop()}} cleanly.
> The health checker may fail to inform the executor to kill an unhealthy task
> after max_consecutive_failures.
> ------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0,
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Priority: Blocker
> Fix For: 0.27.0
>
>
> This was reported by [~tan] experimenting with health checks. Many tasks were
> launched with the following health check, taken from the container
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check
> --executor=(1)@127.0.0.1:39629
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
> --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed,
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits.
> Unfortunately, this may lead to a message drop since libprocess may not have
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has
> been around since the svn days. See
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)