[jira] [Commented] (MESOS-3706) Tasks stuck in staging.

Jord Sonneveld (JIRA) Thu, 29 Oct 2015 13:34:12 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981210#comment-14981210
 ]


Jord Sonneveld commented on MESOS-3706:
---------------------------------------

{noformat}
jord@dalstgmesos03:~$ ps auwx | grep 48116
root     48116  0.0  0.0 1063184 22344 ?       Ssl  Oct16   0:00 
mesos-docker-executor 
--container=mesos-aa2e5fac-6f6b-4943-ad32-edaeb3fb51a1-S2.c2312296-3de2-4ee1-9693-bde232371462
 --docker=docker --docker_socket=/var/run/docker.sock --help=false 
--mapped_directory=/mnt/mesos/sandbox 
--sandbox_directory=/data/mesos/mesos/work/slaves/aa2e5fac-6f6b-4943-ad32-edaeb3fb51a1-S2/frameworks/aa2e5fac-6f6b-4943-ad32-edaeb3fb51a1-0000/executors/kwe-vinland.02d09d14-745b-11e5-9566-005056b0582f/runs/c2312296-3de2-4ee1-9693-bde232371462
 --stop_timeout=0ns
jord@dalstgmesos03:~$ sudo strace -p 48116
Process 48116 attached
futex(0x17a8dcc, FUTEX_WAIT_PRIVATE, 1, NULL
^CProcess 48116 detached
 <detached ...>
jord@dalstgmesos03:~$ 
{noformat}

I let it run for about 1 hr and then stopped the strace.

> Tasks stuck in staging.
> -----------------------
>
>                 Key: MESOS-3706
>                 URL: https://issues.apache.org/jira/browse/MESOS-3706
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, slave
>    Affects Versions: 0.23.0, 0.24.1
>            Reporter: Jord Sonneveld
>         Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot 
> 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO, 
> mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout
>
>
> I have a docker image which starts fine on all my slaves except for one.  On 
> that one, it is stuck in STAGING for a long time and never starts.  The INFO 
> log is full of messages like this:
> I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task 
> kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 
> 20150109-172016-504433162-5050-19367-0002
> E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: 
> Transport endpoint is not connected [107]
> kwe-vinland-work is the task that is stuck in staging.  It is launched by 
> marathon.  I have launched 161 instances successfully on my cluster.  But it 
> refuses to launch on this specific slave.
> These machines are all managed via ansible so their configurations are / 
> should be identical.  I have re-run my ansible scripts and rebooted the 
> machines to no avail.
> It's been in this state for almost 30 minutes.  You can see the mesos docker 
> executor is still running:
> jord@dalstgmesos03:~$ date
> Mon Oct 12 16:13:55 UTC 2015
> jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland
> root     35360  0.0  0.0 1070576 21476 ?       Ssl  15:46   0:00 
> mesos-docker-executor 
> --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox 
> --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --stop_timeout=0ns
> According to docker ps -a, nothing was ever even launched:
> jord@dalstgmesos03:/data/mesos$ sudo docker ps -a
> CONTAINER ID        IMAGE                                              
> COMMAND                  CREATED             STATUS              PORTS        
>                                     NAMES
> 5c858b90b0a0        registry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   39 minutes ago      Up 39 minutes       
> 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp   statsd-fe-influxdb
> d765ba3829fd        registry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   41 minutes ago      Up 41 minutes       
> 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp   statsd-repeater
> Those are the only two entries. Nothing about the kwe-vinland job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3706) Tasks stuck in staging.

Reply via email to