[
https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115448#comment-15115448
]
doubao commented on MESOS-3706:
-------------------------------
HI,all
Very fortunate this issue is still have reply.
By the "staging" status, i have some view. Because of the enough time and
method, i don't to test this:
(The following is just my opinion, there is an error, then you do not
mind.)
At first , i see my environment is same like this issue (mesos 0.26). And
i also have "staging" status.
I think it maybe happen by docker.
At one time , my colleague find when he start a failed docker container.
The container may to delete , and 'docker ps' don't have some info. But when he
'cd ' the container's data-dir(docker's data), he find that there also have the
container's dir . And i know this docker's storage driver is
default--device-mapper,and don't create lvm, it use the mounted dir. So i think
maybe this is a problem . Because docker's man say don't use this mode at
product .
(https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/)
So i think 'staging' have two reason: wrong docker container(many
reason);docker storage driver.
This just my sudden thoughts. But i can't to test it , some one have some
idea?
To [~travis.hegner] :
I read your reply, and plus my view. I think that mesos to get docker info
maybe have problem. Because if the docker container have error , mesos excutor
will delete it , and this place will have some problem by docker-storage-driver
:docker ps is ok, but filesystem also have this container-dir . So it have
staging status. To your problem , i think maybe by this. a...maybe i don't talk
clear..
Because i update my docker sys env to lvm-storage-driver, staging status is
less frequently..
Because of Chinese New Year, this day i can't to verify my thoughts. I
will follow-up experiments it.
> Tasks stuck in staging.
> -----------------------
>
> Key: MESOS-3706
> URL: https://issues.apache.org/jira/browse/MESOS-3706
> Project: Mesos
> Issue Type: Bug
> Components: docker, slave
> Affects Versions: 0.23.0, 0.24.1
> Reporter: Jord Sonneveld
> Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot
> 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO,
> mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout
>
>
> I have a docker image which starts fine on all my slaves except for one. On
> that one, it is stuck in STAGING for a long time and never starts. The INFO
> log is full of messages like this:
> I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task
> kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework
> 20150109-172016-504433162-5050-19367-0002
> E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12:
> Transport endpoint is not connected [107]
> kwe-vinland-work is the task that is stuck in staging. It is launched by
> marathon. I have launched 161 instances successfully on my cluster. But it
> refuses to launch on this specific slave.
> These machines are all managed via ansible so their configurations are /
> should be identical. I have re-run my ansible scripts and rebooted the
> machines to no avail.
> It's been in this state for almost 30 minutes. You can see the mesos docker
> executor is still running:
> jord@dalstgmesos03:~$ date
> Mon Oct 12 16:13:55 UTC 2015
> jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland
> root 35360 0.0 0.0 1070576 21476 ? Ssl 15:46 0:00
> mesos-docker-executor
> --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe
> --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe
> --stop_timeout=0ns
> According to docker ps -a, nothing was ever even launched:
> jord@dalstgmesos03:/data/mesos$ sudo docker ps -a
> CONTAINER ID IMAGE
> COMMAND CREATED STATUS PORTS
> NAMES
> 5c858b90b0a0 registry.roger.dal.moz.com:5000/moz-statsd-v0.22
> "/bin/sh -c ./start.s" 39 minutes ago Up 39 minutes
> 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp statsd-fe-influxdb
> d765ba3829fd registry.roger.dal.moz.com:5000/moz-statsd-v0.22
> "/bin/sh -c ./start.s" 41 minutes ago Up 41 minutes
> 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp statsd-repeater
> Those are the only two entries. Nothing about the kwe-vinland job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)