[
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593904#comment-14593904
]
Chris Fortier commented on MESOS-2419:
--------------------------------------
that would be fantastic!
Here's the systemd unit and the associated logs:
Systemd unit:
```
[Unit]
Description=MesosSlave
After=docker.service dockercfg.service
Requires=docker.service dockercfg.service
[Service]
Environment=MESOS_IMAGE=mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404
Environment=ZOOKEEPER=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181
User=core
KillMode=process
Restart=on-failure
RestartSec=20
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill mesos_slave
ExecStartPre=-/usr/bin/docker rm mesos_slave
ExecStartPre=/usr/bin/docker pull ${MESOS_IMAGE}
ExecStart=/usr/bin/sh -c "sudo /usr/bin/docker run \
--name=mesos_slave \
--net=host \
--privileged \
-v /home/core/.dockercfg:/root/.dockercfg:ro \
-v /sys:/sys \
-v /usr/bin/docker:/usr/bin/docker:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
-v /var/lib/mesos/slave:/var/lib/mesos/slave \
${MESOS_IMAGE} \
--ip=$(/usr/bin/ip -o -4 addr list eth0 | grep global | awk \'{print $4}\'
| cut -d/ -f1) \
--attributes=zone:$(curl -s
http://169.254.169.254/latest/meta-data/placement/availability-zone)\;os:coreos
\
--containerizers=docker \
--executor_registration_timeout=10mins \
--hostname=`curl -s
http://169.254.169.254/latest/meta-data/public-hostname` \
--isolation=cgroups/cpu,cgroups/mem \
--log_dir=/var/log/mesos \
--master=zk://${ZOOKEEPER}/mesos \
--work_dir=/var/lib/mesos/slave"
ExecStop=/usr/bin/docker stop mesos_slave
ExecStartPost=/usr/bin/docker pull behance/utility:latest
ExecStartPost=/usr/bin/docker pull ubuntu:14.04
ExecStartPost=/usr/bin/docker pull debian:jessie
[Install]
WantedBy=multi-user.target
[X-Fleet]
Global=true
MachineMetadata=role=worker
```
Logs:
```
fortier@ip-10-43-3-126 ~ $ docker logs 7cd21326a98c
I0619 17:57:22.075104 15406 logging.cpp:172] INFO level logging started!
I0619 17:57:22.075305 15406 main.cpp:156] Build: 2015-05-05 06:15:50 by root
I0619 17:57:22.075314 15406 main.cpp:158] Version: 0.22.1
I0619 17:57:22.075319 15406 main.cpp:161] Git tag: 0.22.1
I0619 17:57:22.075322 15406 main.cpp:165] Git SHA:
d6309f92a7f9af3ab61a878403e3d9c284ea87e0
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@712: Client
environment:zookeeper.version=zookeeper C client 3.4.5
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@716: Client
environment:host.name=ip-10-43-3-126.us-west-2.compute.internal
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@723: Client
environment:os.name=Linux
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@724: Client
environment:os.arch=4.0.5
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@725: Client
environment:os.version=#2 SMP Thu Jun 18 08:53:45 UTC 2015
I0619 17:57:22.177387 15406 main.cpp:200] Starting Mesos slave
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@733: Client
environment:user.name=(null)
I0619 17:57:22.178097 15406 slave.cpp:174] Slave started on 1)@10.43.3.126:5051
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@741: Client
environment:user.home=/root
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@753: Client
environment:user.dir=/
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@zookeeper_init@786:
Initiating client connection,
host=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181
sessionTimeout=10000 watcher=0x7f91936bfa60 sessionId=0 sessionPasswd=<null>
context=0x7f9178001010 flags=0
I0619 17:57:22.178235 15406 slave.cpp:322] Slave resources: cpus(*):8;
mem(*):14019; disk(*):42121; ports(*):[31000-32000]
I0619 17:57:22.178401 15406 slave.cpp:351] Slave hostname:
ec2-52-24-66-221.us-west-2.compute.amazonaws.com
I0619 17:57:22.178427 15406 slave.cpp:352] Slave checkpoint: true
I0619 17:57:22.179797 15415 state.cpp:35] Recovering state from
'/var/lib/mesos/slave/meta'
I0619 17:57:22.180737 15417 slave.cpp:3890] Recovering framework
20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.180768 15417 slave.cpp:4319] Recovering executor
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.181002 15412 status_update_manager.cpp:197] Recovering status
update manager
I0619 17:57:22.181032 15412 status_update_manager.cpp:205] Recovering executor
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.181354 15414 docker.cpp:423] Recovering Docker containers
2015-06-19 17:57:22,188:15406(0x7f918affa700):ZOO_INFO@check_events@1703:
initiated connection to server [10.43.103.108:2181]
2015-06-19 17:57:22,195:15406(0x7f918affa700):ZOO_INFO@check_events@1750:
session establishment complete on server [10.43.103.108:2181],
sessionId=0x14e0883caef002a, negotiated timeout=10000
I0619 17:57:22.195842 15410 group.cpp:313] Group process
(group(1)@10.43.3.126:5051) connected to ZooKeeper
I0619 17:57:22.195871 15410 group.cpp:790] Syncing group operations: queue size
(joins, cancels, datas) = (0, 0, 0)
I0619 17:57:22.195883 15410 group.cpp:385] Trying to create path '/mesos' in
ZooKeeper
I0619 17:57:22.199296 15410 detector.cpp:138] Detected a new leader: (id='23')
I0619 17:57:22.199357 15410 group.cpp:659] Trying to get
'/mesos/info_0000000023' in ZooKeeper
I0619 17:57:22.201074 15410 detector.cpp:452] A new leading master
([email protected]:5050) is detected
I0619 17:57:22.380323 15417 docker.cpp:516] Recovering container
'ac11c0c6-b753-4bbd-8276-427de0a92f62' for executor
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.381031 15415 slave.cpp:3749] Sending reconnect request to
executor
portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe
of framework 20150612-153240-4144114442-5050-1-0000 at
executor(1)@10.43.3.126:57513
I0619 17:57:24.382136 15410 slave.cpp:2480] Cleaning up un-reregistered
executors
I0619 17:57:24.382182 15410 slave.cpp:2498] Killing un-reregistered executor
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:24.382235 15410 slave.cpp:3808] Finished recovery
I0619 17:57:24.382254 15411 docker.cpp:1212] Destroying container
'ac11c0c6-b753-4bbd-8276-427de0a92f62'
I0619 17:57:24.382295 15411 docker.cpp:1319] Running docker stop on container
'ac11c0c6-b753-4bbd-8276-427de0a92f62’
```
> Slave recovery not recovering tasks when using systemd
> ------------------------------------------------------
>
> Key: MESOS-2419
> URL: https://issues.apache.org/jira/browse/MESOS-2419
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Reporter: Brenden Matthews
> Assignee: Joerg Schad
> Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> {color:red}
> Note: the resolution to this issue is described in the following comment
> below:
> https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
> {color:red}
> In a recent build from master (updated yesterday), slave recovery appears to
> have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called
> `long-running-job` which is a Chronos job that just does `sleep 1h`. After
> restarting the slave, the task shows as `TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.247207 10022 docker.cpp:468] Recovering container
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.254791 10022 docker.cpp:1333] Executor for container
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.254812 10022 docker.cpp:1159] Destroying container
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.262565 10027 containerizer.cpp:353] Recovering container
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup
> for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227
> 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container
> f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:49.266466 10022 containerizer.cpp:938] Destroying container
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227
> 00:09:50.597843 10024 slave.cpp:3175] Termination of executor
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework
> '20150226-230228-2931198986-5050-717-0000' failed: Container
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227
> 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for
> executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227
> 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for
> container f2001064-e076-4978-b764-ed12a5244e78 of executor
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal
> task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78'
> not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227
> 00:09:50.599148 10024 composing.cpp:513] Container
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update
> TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for
> status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for
> task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227
> 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework
> 20150226-230228-2931198986-5050-717-0000 sent by status update manager
> because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227
> 00:09:52.594379 10021 slave.cpp:3794] Finished recovery
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)