[
https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543439#comment-14543439
]
Owen Smith edited comment on MESOS-2276 at 5/14/15 9:51 AM:
------------------------------------------------------------
I've experienced this same issue, and debugging it / figuring out a course of
action was pretty tough.
bq. (we have an app which crashes on startup right now, retrying to restart
every few seconds)
Yup, that was the trigger situation for us too. When using frameworks like
marathon, it's pretty easy for someone to accidentally create a situation like
this while developing.
For others' benefit, it's not always _just_ mesos at fault here. With enough
dead containers, there can be additional complications from docker itself. For
example, [~sivaramsk] I think I saw the same thing (although regretfully I
didn't check the lsof counts). We ended up attributing to our use of
devicemapper as the docker storage driver, based on some nasty
docker+devicemapper issues we've seen previously. We ended up needing to
restart the affected machines :-/ (and switched to aufs for docker while we
were at it)
was (Author: orls):
I've experienced this same issue, and debugging it / figuring out a course of
action was pretty tough.
> (we have an app which crashes on startup right now, retrying to restart every
> few seconds)
Yup, that was the trigger situation for us too. When using frameworks like
marathon, it's pretty easy for someone to accidentally create a situation like
this while developing.
For others' benefit, it's not always _just_ mesos at fault here. With enough
dead containers, there can be additional complications from docker itself. For
example, [~sivaramsk] I think I saw the same thing (although regretfully I
didn't check the lsof counts). We ended up attributing to our use of
devicemapper as the docker storage driver, based on some nasty
docker+devicemapper issues we've seen previously. We ended up needing to
restart the affected machines :-/ (and switched to aufs for docker while we
were at it)
> Mesos-slave refuses to startup with many stopped docker containers
> ------------------------------------------------------------------
>
> Key: MESOS-2276
> URL: https://issues.apache.org/jira/browse/MESOS-2276
> Project: Mesos
> Issue Type: Bug
> Components: docker, slave
> Affects Versions: 0.21.0, 0.21.1
> Environment: Ubuntu 14.04LTS, Mesosphere packages
> Reporter: Dr. Stefan Schimanski
>
> The mesos-slave is launched as
> # /usr/local/sbin/mesos-slave
> --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2
> --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint
> --containerizers=docker --executor_registration_timeout=5mins
> --logging_level=INFO
> giving this output:
> I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
> I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
> I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
> I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
> I0127 19:26:32.674824 19880 main.cpp:151] Git SHA:
> ab8fa655d34e8e15a4290422df38a18db1c09b5b
> I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client
> environment:host.name=srv002
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client
> environment:os.name=Linux
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client
> environment:os.arch=3.13.0-44-generic
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client
> environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client
> environment:user.name=root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client
> environment:user.home=/root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client
> environment:user.dir=/root
> 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786:
> Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181
> sessionTimeout=10000 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd=<null>
> context=0x7fceec0009e0 flags=0
> I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
> I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8;
> mem(*):6960; disk(*):246731; ports(*):[31000-32000]
> I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
> I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
> 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703:
> initiated connection to server [10.0.0.1:2181]
> I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from
> '/tmp/mesos/meta'
> I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status
> update manager
> I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
> 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750:
> session establishment complete on server [10.0.0.1:2181],
> sessionId=0x14b2adf7a560106, negotiated timeout=10000
> I0127 19:26:32.823292 19885 group.cpp:313] Group process
> (group(1)@10.0.0.2:5051) connected to ZooKeeper
> I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in
> ZooKeeper
> I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader:
> (id='143')
> I0127 19:26:32.830559 19882 group.cpp:659] Trying to get
> '/mesos/info_0000000143' in ZooKeeper
> I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master
> ([email protected]:5050) is detected
> Failed to perform recovery: Collect failed: Failed to create pipe: Too many
> open files
> To remedy this do as follows:
> Step 1: rm -f /tmp/mesos/meta/slaves/latest
> This ensures slave doesn't recover old live executors.
> Step 2: Restart the slave.
> At /tmp/mesos/meta/slaves/latest there is nothing.
> The slave was part of a 3 node cluster before.
> When started as an upstart service, the process is relaunched all the time
> and a large number of defunct processes appear, like these ones:
> root 30321 0.0 0.0 13000 440 ? S 19:28 0:00 iptables
> --wait -L -n
> root 30322 0.0 0.0 4444 396 ? S 19:28 0:00 sh -c docker
> inspect mesos-e1f538b4-993a-4cd4-99b0-d633c5e9dd55
> root 30328 0.0 0.0 0 0 ? Z 19:28 0:00 [sh]
> <defunct>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)