[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341066#comment-14341066
 ] 

Cody Maloney commented on MESOS-2419:
-------------------------------------

Reading the logs one interesting thing is we definitely checkpoint the task 
before the slave is shutdown. But when coming back up, state recovers properly 
from the work directory, but only one framework is recovered 
(20150227-014043-2931198986-5050-7728-0002), not all the frameworks on the host 
(The "Recovering framework" message). 

The slave did find the framework / executor though, because it sends the status 
update for it before the master even asks about the task...

Feb 27 21:46:21 ip-10-180-39-169.ec2.internal mesos-slave[15794]: W0227 
21:46:21.273015 15797 status_update_manager.cpp:185] Resending status update 
TASK_FAILED (UUID: 3d42f411-eb80-406a-b4ca-c3a5428cbe17) for task 
chronos.0674b317-beca-11e4-b59b-566d21d75321 of framework 
20150226-230228-2931198986-5050-717-0000

> Slave recovery not recovering tasks
> -----------------------------------
>
>                 Key: MESOS-2419
>                 URL: https://issues.apache.org/jira/browse/MESOS-2419
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.0, 0.23.0
>            Reporter: Brenden Matthews
>         Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> In a recent build from master (updated yesterday), slave recovery appears to 
> have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called 
> `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
> restarting the slave, the task shows as `TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247207 10022 docker.cpp:468] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
> for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
> f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> '20150226-230228-2931198986-5050-717-0000' failed: Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
> executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for 
> container f2001064-e076-4978-b764-ed12a5244e78 of executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal 
> task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78' 
> not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.599148 10024 composing.cpp:513] Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update 
> TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for 
> status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for 
> task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 sent by status update manager 
> because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594379 10021 slave.cpp:3794] Finished recovery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to