[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks

Sergey Galkin (JIRA) Wed, 23 Mar 2016 08:09:03 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208554#comment-15208554
 ]


Sergey Galkin commented on MESOS-4999:
--------------------------------------

Also we have found what on the several mesos-cluster was hide docker container.
On example
mesos create 2 docker container 
b3e5942f08ab  -  172.17.0.3
65120c7a4097 - 172.17.0.2

But when we try to ping 172.17.0.3 it successfull pinged but we did not see 
ICMP packets inside b3e5942f08ab. In the same time in the 65120c7a4097 we have 
seen ICMP packets with tcpdump. 
After _service docker stop_ we have seen interface what had to disappear
veth4ae983a Link encap:Ethernet  HWaddr 4e:58:fc:62:e9:0b  
          inet6 addr: fe80::4c58:fcff:fe62:e90b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:217964 errors:0 dropped:0 overruns:0 frame:0
          TX packets:618230 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:15541840 (15.5 MB)  TX bytes:64680192 (64.6 MB)

and can ping 172.17.0.3 (process docker was stopped  in this time)

we have found what in me memory was running nginx with 100% CPU. It seems as 
hidden hanging docker container. We can't kill it with _kill -9_ and only 
restart nodes helps.
We have seen this picture on the 10-12 mesos slaves. 


> Mesos (or Marathon) lost tasks
> ------------------------------
>
>                 Key: MESOS-4999
>                 URL: https://issues.apache.org/jira/browse/MESOS-4999
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.2
>         Environment: mesos - 0.27.0
> marathon - 0.15.2
> 189 mesos slaves with Ubuntu 14.04.2 on HP ProLiant DL380 Gen9,
> CPU - 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @2.50GHz (48 cores (with 
> hyperthreading))
> RAM - 264G,
> Storage - 3.0T on RAID on HP Smart Array P840 Controller,
> HDD - 12 x HP EH0600JDYTL
> Network - 2 x Intel Corporation Ethernet 10G 2P X710,
>            Reporter: Sergey Galkin
>         Attachments: agent-mesos-docker-logs.tar.xz, 
> masternode-1-mesos-marathon-log.tar.xz, 
> masternode-3-mesos-marathon-log.tar.xz, mesos-nodes.png
>
>
> After a lot of create/delete application  with docker instances  through 
> Marathon API I have a lot of lost tasks after last *deleting all application 
> in Marathon*.
> They are divided into three types
> 1. Tasks hangs in STAGED status. I don't see this tasks in 'docker ps' on the 
> slave and _service docker restart_ on mesos slave did not fix these tasks.
> 2. RUNNING because docker hangs and can't delete these instances  (a lot of 
> {code}
> Killing docker task
> Shutting down
> Killing docker task
> Shutting down
> {code}
>  in stdout,  
> _docker stop ID_ hangs and these tasks can be fixed by _service docker 
> restart_ on mesos slave.
> 3. RUNNING after _service docker restart_ on mesos slave.
> Screenshot attached 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4999) Mesos (or Marathon) lost tasks

Reply via email to