[jira] [Comment Edited] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

Arne Visscher (JIRA) Thu, 03 May 2018 06:42:31 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462455#comment-16462455
 ]


Arne Visscher edited comment on MESOS-2276 at 5/3/18 1:41 PM:
--------------------------------------------------------------

I also have encountered issues where a dead container ended up in a mesos-slave 
ami which would then fail to register to the cluster. (Kept restarting and 
registered pretty much immediately uppon removal of the dead container.)


was (Author: kiwivogel):
I also have encountered issues where a dead container ended up in a mesos-slave 
ami which would then fail to register to the cluster. (Kept restarting and 
registered pretty much immediately uppon removal of the dead container.)

> Mesos-slave refuses to startup with many stopped docker containers
> ------------------------------------------------------------------
>
>                 Key: MESOS-2276
>                 URL: https://issues.apache.org/jira/browse/MESOS-2276
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, docker
>    Affects Versions: 0.21.0, 0.21.1
>         Environment: Ubuntu 14.04LTS, Mesosphere packages
>            Reporter: Dr. Stefan Schimanski
>            Priority: Major
>
> The mesos-slave is launched as
> # /usr/local/sbin/mesos-slave 
> --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 
> --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint 
> --containerizers=docker --executor_registration_timeout=5mins 
> --logging_level=INFO
> giving this output:
> I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
> I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
> I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
> I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
> I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: 
> ab8fa655d34e8e15a4290422df38a18db1c09b5b
> I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client 
> environment:host.name=srv002
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.13.0-44-generic
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client 
> environment:user.name=root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/root
> 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: 
> Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 
> sessionTimeout=10000 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd=<null> 
> context=0x7fceec0009e0 flags=0
> I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
> I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; 
> mem(*):6960; disk(*):246731; ports(*):[31000-32000]
> I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
> I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
> 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: 
> initiated connection to server [10.0.0.1:2181]
> I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from 
> '/tmp/mesos/meta'
> I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status 
> update manager
> I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
> 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: 
> session establishment complete on server [10.0.0.1:2181], 
> sessionId=0x14b2adf7a560106, negotiated timeout=10000
> I0127 19:26:32.823292 19885 group.cpp:313] Group process 
> (group(1)@10.0.0.2:5051) connected to ZooKeeper
> I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in 
> ZooKeeper
> I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: 
> (id='143')
> I0127 19:26:32.830559 19882 group.cpp:659] Trying to get 
> '/mesos/info_0000000143' in ZooKeeper
> I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master 
> (UPID=master@10.0.0.1:5050) is detected
> Failed to perform recovery: Collect failed: Failed to create pipe: Too many 
> open files
> To remedy this do as follows:
> Step 1: rm -f /tmp/mesos/meta/slaves/latest
>         This ensures slave doesn't recover old live executors.
> Step 2: Restart the slave.
> At /tmp/mesos/meta/slaves/latest there is nothing.
> The slave was part of a 3 node cluster before.
> When started as an upstart service, the process is relaunched all the time 
> and a large number of defunct processes appear, like these ones:
> root     30321  0.0  0.0  13000   440 ?        S    19:28   0:00 iptables 
> --wait -L -n
> root     30322  0.0  0.0   4444   396 ?        S    19:28   0:00 sh -c docker 
> inspect mesos-e1f538b4-993a-4cd4-99b0-d633c5e9dd55
> root     30328  0.0  0.0      0     0 ?        Z    19:28   0:00 [sh] 
> <defunct>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

Reply via email to