[
https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114988#comment-16114988
]
Benjamin Mahler commented on MESOS-7783:
----------------------------------------
The bug occurs as follows:
(1) Two (or more) tasks arrive at the agent, but do not yet reach
{{Slave::_run}}.
(2) Kill task messages arrive at the agent and are processed.
(3) The first task to reach {{Slave::_run}} will cause the framework to be
removed, since the pending tasks / executors are now empty (see
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1841-L1845]).
(4) The remaining tasks to reach {{Slave::_run}} encounter the framework as
removed and are dropped without a status update (see
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1788-L1794]).
> Framework might not receive status update when a just launched task is killed
> immediately
> -----------------------------------------------------------------------------------------
>
> Key: MESOS-7783
> URL: https://issues.apache.org/jira/browse/MESOS-7783
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.2.0
> Reporter: Benjamin Bannier
> Priority: Critical
> Labels: reliability
> Attachments: GroupDeployIntegrationTest.log.zip, logs
>
>
> Our Marathon team are seeing issues in their integration test suite when
> Marathon gets stuck in an infinite loop trying to kill a just launched task.
> In their test a task launched which is immediately followed by killing the
> task -- the framework does e.g., not wait for any task status update.
> In this case the launch and kill messages arrive at the agent in the correct
> order, but both the launch and kill paths in the agent do not reach the point
> where a status update is sent to the framework. Since the framework has seen
> no status update on the task it re-triggers a kill, causing an infinite loop.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)