Benjamin Mahler created MESOS-4106:
--------------------------------------
Summary: The health checker may fail to inform the executor to
kill an unhealthy task after max_consecutive_failures.
Key: MESOS-4106
URL: https://issues.apache.org/jira/browse/MESOS-4106
Project: Mesos
Issue Type: Bug
Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1,
0.21.2, 0.21.1, 0.20.1, 0.20.0
Reporter: Benjamin Mahler
Priority: Blocker
This was reported by [~tan] experimenting with health checks. Many tasks were
launched with the following health check, taken from the container
stdout/stderr:
{code}
Launching health check process: /usr/local/libexec/mesos/mesos-health-check
--executor=(1)@127.0.0.1:39629
--health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
--task_id=sleepy-2
{code}
This should have led to all tasks getting killed due to
{{\-\-consecutive_failures}} being set, however, only some tasks get killed,
while other remain running.
It turns out that the health check binary does a {{send}} and promptly exits.
Unfortunately, this may lead to a message drop since libprocess may not have
sent this message over the socket by the time the process exits.
We work around this in the command executor with a manual sleep, which has been
around since the svn days. See
[here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)