[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319920#comment-16319920
 ] 

Gilbert Song commented on MESOS-8391:
-------------------------------------

The root cause is found. This bug was from this patch 
https://reviews.apache.org/r/63887/

Basically, the marathon does have the correct task update since the master does 
not send it, while the master itself does not have the correct task status. 
Because the agent v1 api {{WAIT_NESTED_CONTAINER}} is called by the default 
executor and the agent forwards it to the composing containerizer, but the 
{{ComposingContainerizer::wait()}} skips it.
https://github.com/apache/mesos/blob/master/src/slave/containerizer/composing.cpp#L585~#L587

This bug in composing containerizer is only reproducible after the agent 
restarts and kill any task. If a task is killed, the hashmap {{containers_}} in 
composing containerizer is not maintained correctly and the termination future 
is returned instead. Before the patch r/63887, there is no such problem. 
Because the composing containerizer calls underlying containerizer::wait() 
directly.

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> -----------------------------------------------------------------------------------
>
>                 Key: MESOS-8391
>                 URL: https://issues.apache.org/jira/browse/MESOS-8391
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization, executor
>    Affects Versions: 1.5.0
>            Reporter: Ivan Chernetsky
>            Assignee: Gilbert Song
>            Priority: Blocker
>         Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 10000}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to