[ https://issues.apache.org/jira/browse/MESOS-7783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114988#comment-16114988 ]
Benjamin Mahler commented on MESOS-7783: ---------------------------------------- The bug occurs as follows: (1) Two (or more) tasks arrive at the agent, but do not yet reach {{Slave::_run}}. (2) Kill task messages arrive at the agent and are processed. (3) The first task to reach {{Slave::_run}} will cause the framework to be removed, since the pending tasks / executors are now empty (see [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1841-L1845]). (4) The remaining tasks to reach {{Slave::_run}} encounter the framework as removed and are dropped without a status update (see [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp?utf8=%E2%9C%93#L1788-L1794]). > Framework might not receive status update when a just launched task is killed > immediately > ----------------------------------------------------------------------------------------- > > Key: MESOS-7783 > URL: https://issues.apache.org/jira/browse/MESOS-7783 > Project: Mesos > Issue Type: Bug > Components: agent > Affects Versions: 1.2.0 > Reporter: Benjamin Bannier > Priority: Critical > Labels: reliability > Attachments: GroupDeployIntegrationTest.log.zip, logs > > > Our Marathon team are seeing issues in their integration test suite when > Marathon gets stuck in an infinite loop trying to kill a just launched task. > In their test a task launched which is immediately followed by killing the > task -- the framework does e.g., not wait for any task status update. > In this case the launch and kill messages arrive at the agent in the correct > order, but both the launch and kill paths in the agent do not reach the point > where a status update is sent to the framework. Since the framework has seen > no status update on the task it re-triggers a kill, causing an infinite loop. -- This message was sent by Atlassian JIRA (v6.4.14#64029)