[jira] [Comment Edited] (MESOS-2419) Slave recovery not recovering tasks

Geoffroy Jabouley (JIRA) Thu, 12 Mar 2015 01:31:56 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358288#comment-14358288
 ]


Geoffroy Jabouley edited comment on MESOS-2419 at 3/12/15 8:28 AM:
-------------------------------------------------------------------

Hello

i have tried starting manually the slave, and it ends with the same issue.

What is strange is that the task is perfectly running inside its container, and 
both Marathon and Mesos master are showing "OK" status for the task.

I can even suspend it on Marathon, and the docker container then get terminated 
on the mesos slave (with some errors...)

{code}
I0312 09:14:19.381512   631 slave.cpp:1372] Asked to kill task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0312 09:14:19.453346   633 slave.cpp:2215] Handling status update TASK_KILLED 
(UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 from executor(1)@10.195.96.237:46433
E0312 09:14:19.453675   633 slave.cpp:2344] Failed to update resources for 
container 6b88c27d-e975-4866-a725-7b410e4cec15 of executor 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 running task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 on status update for 
terminal task, destroying container: Collect failed: Unknown container
I0312 09:14:19.453768   633 status_update_manager.cpp:317] Received status 
update TASK_KILLED (UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0312 09:14:19.453801   633 status_update_manager.hpp:346] Checkpointing UPDATE 
for status update TASK_KILLED (UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for 
task test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0312 09:14:19.467113   633 slave.cpp:2458] Forwarding the update TASK_KILLED 
(UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 to [email protected]:5050
I0312 09:14:19.467362   633 slave.cpp:2391] Sending acknowledgement for status 
update TASK_KILLED (UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000 to executor(1)@10.195.96.237:46433
I0312 09:14:19.475836   633 status_update_manager.cpp:389] Received status 
update acknowledgement (UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for task 
test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0312 09:14:19.475875   633 status_update_manager.hpp:346] Checkpointing ACK 
for status update TASK_KILLED (UUID: d6173551-35c1-4e65-8370-624ee3ce2aa8) for 
task test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799 of framework 
20150311-150951-3982541578-5050-50860-0000
I0312 09:14:20.515034   633 docker.cpp:1678] Executor for container 
'6b88c27d-e975-4866-a725-7b410e4cec15' has exited
I0312 09:14:20.515118   633 docker.cpp:1501] Destroying container 
'6b88c27d-e975-4866-a725-7b410e4cec15'
I0312 09:14:20.515192   633 docker.cpp:1593] Running docker stop on container 
'6b88c27d-e975-4866-a725-7b410e4cec15'
I0312 09:14:20.516291   633 containerizer.cpp:1117] Executor for container 
'6b88c27d-e975-4866-a725-7b410e4cec15' has exited
I0312 09:14:20.516821   633 slave.cpp:2891] Executor 
'test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000 has terminated with unknown status
I0312 09:14:20.516870   633 slave.cpp:3007] Cleaning up executor 
'test-app-bveaf.498c0945-c88f-11e4-946b-56847afe9799' of framework 
20150311-150951-3982541578-5050-50860-0000
{code}

Maybe this is different issue than this one.


was (Author: geoffroy.jabouley):
Hello

i have tried starting manually the slave, and it ends with the same issue.

What is strange is that the task is perfectly running inside its container, and 
both Marathon and Mesos master are showing "OK" status for the task.
I can even suspend it on Marathon, and the docker container then get correclty 
terminated on the mesos cluster.

> Slave recovery not recovering tasks
> -----------------------------------
>
>                 Key: MESOS-2419
>                 URL: https://issues.apache.org/jira/browse/MESOS-2419
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.0, 0.23.0
>            Reporter: Brenden Matthews
>            Assignee: Joerg Schad
>         Attachments: mesos-chronos.log.gz, mesos.log.gz
>
>
> In a recent build from master (updated yesterday), slave recovery appears to 
> have broken.
> I'll attach the slave log (with GLOG_v=1) showing a task called 
> `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
> restarting the slave, the task shows as `TASK_FAILED`.
> Here's another case, which is for a docker task:
> {noformat}
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.247207 10022 docker.cpp:468] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
> 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
> for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
> f2001064-e076-4978-b764-ed12a5244e78
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
> 'f2001064-e076-4978-b764-ed12a5244e78' has exited
> Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
> 'f2001064-e076-4978-b764-ed12a5244e78'
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 at executor(1)@10.81.189.232:43130
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
> 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
> '20150226-230228-2931198986-5050-717-0000' failed: Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
> executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000: Not monitored
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 from @0.0.0.0:0
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
> 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for 
> container f2001064-e076-4978-b764-ed12a5244e78 of executor 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal 
> task, destroying container: Container 'f2001064-e076-4978-b764-ed12a5244e78' 
> not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.599148 10024 composing.cpp:513] Container 
> 'f2001064-e076-4978-b764-ed12a5244e78' not found
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599220 10024 status_update_manager.cpp:317] Received status update 
> TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:50.599256 10024 status_update_manager.hpp:346] Checkpointing UPDATE for 
> status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for 
> task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000
> Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
> 00:09:50.607086 10022 slave.cpp:2706] Dropping status update TASK_FAILED 
> (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
> chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
> 20150226-230228-2931198986-5050-717-0000 sent by status update manager 
> because the slave is in RECOVERING state
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594267 10021 slave.cpp:2457] Cleaning up un-reregistered executors
> Feb 27 00:09:52 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
> 00:09:52.594379 10021 slave.cpp:3794] Finished recovery
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2419) Slave recovery not recovering tasks

Reply via email to