[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357028#comment-14357028
 ] 

Joerg Schad commented on MESOS-2419:
------------------------------------

The issue seems to related to the to the way systemd (which is used to start 
the slave in this scenario on coreos) is handling the killing of child 
processes, which basically killed all executors (being children of the slave). 
As a result the executor can not reconnect to the new slave (as it is not 
present). 

The is caused by the default killmode for systemd  is _controlgroup_ (see 
http://www.freedesktop.org/software/systemd/man/systemd.kill.html). I changed 
this to process on my own instance (and also on slave0 on the testcluster). 
This seems to solve the problem.
I am currently further testing several different options (docker/non docker).

When starting the slave manually (i.e. not via systemd) this problem is not 
present and recovery works as expected.





> Slave recovery not recovering tasks
> -----------------------------------
>
>                 Key: MESOS-2419
>                 URL: https://issues.apache.org/jira/browse/MESOS-2419
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.0, 0.23.0
>            Reporter: Brenden Matthews
>            Assignee: Joerg Schad
>         Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> In a recent build from master (updated yesterday), slave recovery appears to 
> have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called 
> `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
> restarting the slave, the task shows as `TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247207 10022 docker.cpp:468] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
> for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
> f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> '20150226-230228-2931198986-5050-717-0000' failed: Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
> executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for 
> container f2001064-e076-4978-b764-ed12a5244e78 of executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal 
> task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78' 
> not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.599148 10024 composing.cpp:513] Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update 
> TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for 
> status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for 
> task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 sent by status update manager 
> because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594379 10021 slave.cpp:3794] Finished recovery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to