[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks

Peter Kolloch (JIRA) Wed, 30 Mar 2016 02:32:36 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217726#comment-15217726
 ]


Peter Kolloch commented on MESOS-4999:
--------------------------------------

"Deleting" tasks in Marathon really means that Marathon submits kills for these 
tasks to Mesos. It will not update or delete the tasks immediately but it will 
wait for a notification from Mesos.

Superficially, this looks like a Mesos agent died. In that case, it often takes 
long (i.e. up to 10min or more, depending on your config) until Mesos responds 
to a kill with a "TASK_KILLED" or "TASK_LOST". Therefore, it looks as if 
Marathon does not respond to the kill.

Ideally, we would expose another task state such as "task kill sent" in the 
Marathon API so that the user sees what is going on. But this is not the case 
yet. Sorry for the confusion.

[I cannot verify this hypothesis easily without the Marathon logs]

> Mesos (or Marathon) lost tasks
> ------------------------------
>
>                 Key: MESOS-4999
>                 URL: https://issues.apache.org/jira/browse/MESOS-4999
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.2
>         Environment: mesos - 0.27.0
> marathon - 0.15.2
> 189 mesos slaves with Ubuntu 14.04.2 on HP ProLiant DL380 Gen9,
> CPU - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @2.50GHz (48 cores (with 
> hyperthreading))
> RAM - 264G,
> Storage - 3.0T on RAID on HP Smart Array P840 Controller,
> HDD - 12 x HP EH0600JDYTL
> Network - 2 x Intel Corporation Ethernet 10G 2P X710,
>            Reporter: Sergey Galkin
>         Attachments: agent-mesos-docker-logs.tar.xz, 
> masternode-1-mesos-marathon-log.tar.xz, 
> masternode-3-mesos-marathon-log.tar.xz, mesos-nodes.png
>
>
> After a lot of create/delete application  with docker instances  through 
> Marathon API I have a lot of lost tasks after last *deleting all application 
> in Marathon*.
> They are divided into three types
> 1. Tasks hangs in STAGED status. I don't see this tasks in 'docker ps' on the 
> slave and _service docker restart_ on mesos slave did not fix these tasks.
> 2. RUNNING because docker hangs and can't delete these instances  (a lot of 
> {code}
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> {code}
>  in stdout,  
> _docker stop ID_ hangs and these tasks can be fixed by _service docker 
> restart_ on mesos slave.
> 3. RUNNING after _service docker restart_ on mesos slave.
> Screenshot attached 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks

Reply via email to