[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

Chris Fortier (JIRA) Fri, 19 Jun 2015 13:18:29 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593904#comment-14593904
 ]


Chris Fortier commented on MESOS-2419:
--------------------------------------

that would be fantastic!

Here's the systemd unit and the associated logs:

Systemd unit:
```
[Unit]
Description=MesosSlave
After=docker.service dockercfg.service
Requires=docker.service dockercfg.service

[Service]
Environment=MESOS_IMAGE=mesosphere/mesos-slave:0.22.1-1.0.ubuntu1404
Environment=ZOOKEEPER=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181

User=core
KillMode=process
Restart=on-failure
RestartSec=20
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill mesos_slave
ExecStartPre=-/usr/bin/docker rm mesos_slave
ExecStartPre=/usr/bin/docker pull ${MESOS_IMAGE}
ExecStart=/usr/bin/sh -c "sudo /usr/bin/docker run \
    --name=mesos_slave \
    --net=host \
    --privileged \
    -v /home/core/.dockercfg:/root/.dockercfg:ro \
    -v /sys:/sys \
    -v /usr/bin/docker:/usr/bin/docker:ro \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
    -v /var/lib/mesos/slave:/var/lib/mesos/slave \
    ${MESOS_IMAGE} \
    --ip=$(/usr/bin/ip -o -4 addr list eth0 | grep global | awk \'{print $4}\' 
| cut -d/ -f1) \
    --attributes=zone:$(curl -s 
http://169.254.169.254/latest/meta-data/placement/availability-zone)\;os:coreos 
\
    --containerizers=docker \
    --executor_registration_timeout=10mins \
    --hostname=`curl -s 
http://169.254.169.254/latest/meta-data/public-hostname` \
    --isolation=cgroups/cpu,cgroups/mem \
    --log_dir=/var/log/mesos \
    --master=zk://${ZOOKEEPER}/mesos \
    --work_dir=/var/lib/mesos/slave"
ExecStop=/usr/bin/docker stop mesos_slave
ExecStartPost=/usr/bin/docker pull behance/utility:latest
ExecStartPost=/usr/bin/docker pull ubuntu:14.04
ExecStartPost=/usr/bin/docker pull debian:jessie

[Install]
WantedBy=multi-user.target

[X-Fleet]
Global=true
MachineMetadata=role=worker
```


Logs:
```
fortier@ip-10-43-3-126 ~ $ docker logs 7cd21326a98c
I0619 17:57:22.075104 15406 logging.cpp:172] INFO level logging started!
I0619 17:57:22.075305 15406 main.cpp:156] Build: 2015-05-05 06:15:50 by root
I0619 17:57:22.075314 15406 main.cpp:158] Version: 0.22.1
I0619 17:57:22.075319 15406 main.cpp:161] Git tag: 0.22.1
I0619 17:57:22.075322 15406 main.cpp:165] Git SHA: 
d6309f92a7f9af3ab61a878403e3d9c284ea87e0
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@716: Client 
environment:host.name=ip-10-43-3-126.us-west-2.compute.internal
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@723: Client 
environment:os.name=Linux
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@724: Client 
environment:os.arch=4.0.5
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@725: Client 
environment:os.version=#2 SMP Thu Jun 18 08:53:45 UTC 2015
I0619 17:57:22.177387 15406 main.cpp:200] Starting Mesos slave
2015-06-19 17:57:22,177:15406(0x7f918ec5d700):ZOO_INFO@log_env@733: Client 
environment:user.name=(null)
I0619 17:57:22.178097 15406 slave.cpp:174] Slave started on 1)@10.43.3.126:5051
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@741: Client 
environment:user.home=/root
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@log_env@753: Client 
environment:user.dir=/
2015-06-19 17:57:22,178:15406(0x7f918ec5d700):ZOO_INFO@zookeeper_init@786: 
Initiating client connection, 
host=internal-portfolio-Internal-1GZCG4XPMN2WS-969465822.us-west-2.elb.amazonaws.com:2181
 sessionTimeout=10000 watcher=0x7f91936bfa60 sessionId=0 sessionPasswd=<null> 
context=0x7f9178001010 flags=0
I0619 17:57:22.178235 15406 slave.cpp:322] Slave resources: cpus(*):8; 
mem(*):14019; disk(*):42121; ports(*):[31000-32000]
I0619 17:57:22.178401 15406 slave.cpp:351] Slave hostname: 
ec2-52-24-66-221.us-west-2.compute.amazonaws.com
I0619 17:57:22.178427 15406 slave.cpp:352] Slave checkpoint: true
I0619 17:57:22.179797 15415 state.cpp:35] Recovering state from 
'/var/lib/mesos/slave/meta'
I0619 17:57:22.180737 15417 slave.cpp:3890] Recovering framework 
20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.180768 15417 slave.cpp:4319] Recovering executor 
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
 of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.181002 15412 status_update_manager.cpp:197] Recovering status 
update manager
I0619 17:57:22.181032 15412 status_update_manager.cpp:205] Recovering executor 
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
 of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.181354 15414 docker.cpp:423] Recovering Docker containers
2015-06-19 17:57:22,188:15406(0x7f918affa700):ZOO_INFO@check_events@1703: 
initiated connection to server [10.43.103.108:2181]
2015-06-19 17:57:22,195:15406(0x7f918affa700):ZOO_INFO@check_events@1750: 
session establishment complete on server [10.43.103.108:2181], 
sessionId=0x14e0883caef002a, negotiated timeout=10000
I0619 17:57:22.195842 15410 group.cpp:313] Group process 
(group(1)@10.43.3.126:5051) connected to ZooKeeper
I0619 17:57:22.195871 15410 group.cpp:790] Syncing group operations: queue size 
(joins, cancels, datas) = (0, 0, 0)
I0619 17:57:22.195883 15410 group.cpp:385] Trying to create path '/mesos' in 
ZooKeeper
I0619 17:57:22.199296 15410 detector.cpp:138] Detected a new leader: (id='23')
I0619 17:57:22.199357 15410 group.cpp:659] Trying to get 
'/mesos/info_0000000023' in ZooKeeper
I0619 17:57:22.201074 15410 detector.cpp:452] A new leading master 
([email protected]:5050) is detected
I0619 17:57:22.380323 15417 docker.cpp:516] Recovering container 
'ac11c0c6-b753-4bbd-8276-427de0a92f62' for executor 
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
 of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:22.381031 15415 slave.cpp:3749] Sending reconnect request to 
executor 
portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe
 of framework 20150612-153240-4144114442-5050-1-0000 at 
executor(1)@10.43.3.126:57513
I0619 17:57:24.382136 15410 slave.cpp:2480] Cleaning up un-reregistered 
executors
I0619 17:57:24.382182 15410 slave.cpp:2498] Killing un-reregistered executor 
'portfolio-reynard-behance--pro2-reynard---e93109da6e527fe95c203885298d73d40ed9a5aa.7dd074ca-16ab-11e5-9ca3-7a8174cf00fe'
 of framework 20150612-153240-4144114442-5050-1-0000
I0619 17:57:24.382235 15410 slave.cpp:3808] Finished recovery
I0619 17:57:24.382254 15411 docker.cpp:1212] Destroying container 
'ac11c0c6-b753-4bbd-8276-427de0a92f62'
I0619 17:57:24.382295 15411 docker.cpp:1319] Running docker stop on container 
'ac11c0c6-b753-4bbd-8276-427de0a92f62’
```

> Slave recovery not recovering tasks when using systemd
> ------------------------------------------------------
>
>                 Key: MESOS-2419
>                 URL: https://issues.apache.org/jira/browse/MESOS-2419
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>            Reporter: Brenden Matthews
>            Assignee: Joerg Schad
>         Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> {color:red}
> Note: the resolution to this issue is described in the following comment 
> below:
> https://issues.apache.org/jira/browse/MESOS-2419?focusedCommentId=14357028&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357028
> {color:red}
> In a recent build from master (updated yesterday), slave recovery appears to 
> have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called 
> `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
> restarting the slave, the task shows as `TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247207 10022 docker.cpp:468] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
> for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
> f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> '20150226-230228-2931198986-5050-717-0000' failed: Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
> executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for 
> container f2001064-e076-4978-b764-ed12a5244e78 of executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal 
> task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78' 
> not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.599148 10024 composing.cpp:513] Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update 
> TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for 
> status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for 
> task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 sent by status update manager 
> because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594379 10021 slave.cpp:3794] Finished recovery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks when using systemd

Reply via email to