[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050358#comment-15050358 ]
Benjamin Bannier edited comment on MESOS-4106 at 12/10/15 8:50 AM: ------------------------------------------------------------------- Late to the party as this already went in. Just sleeping here to have the message out is a very weak guarantee (it does not guarantee that the message was actually sent). What one should probably do instead to make this robust is block until a state change in {{executor}} happens (with a timeout), e.g., observe change of state of {{taskID}} via querying the {{executor}}. was (Author: bbannier): Late to the party as this already went in. Just {{sleep}}ing here to have the message out is a very weak guarantee (it does not guarantee that the message was actually sent). What one should probably do instead to make this robust is block until a state change in {{executor}} happens (with a timeout), e.g., observe change of state of {{taskID}} via querying the {{executor}}. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > ------------------------------------------------------------------------------------------------------------ > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 > Reporter: Benjamin Mahler > Assignee: Benjamin Mahler > Priority: Blocker > Fix For: 0.27.0 > > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)