[ https://issues.apache.org/jira/browse/MESOS-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kone updated MESOS-6274: ------------------------------ Summary: Agent should not allow HTTP executors to re-subscribe before containerizer recovery is done. (was: Agent should not allow an executor to re-subscribe before containerizer recovery is done.) > Agent should not allow HTTP executors to re-subscribe before containerizer > recovery is done. > -------------------------------------------------------------------------------------------- > > Key: MESOS-6274 > URL: https://issues.apache.org/jira/browse/MESOS-6274 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.0.0, 1.0.1 > Reporter: Jie Yu > Assignee: Anand Mazumdar > Priority: Blocker > Fix For: 1.1.0, 1.0.2 > > > In the old API, agent will send a reconnect request to the executor and then > the executor will register with the agent. > Now, in the new API, agent will allow an executor to re-subscribe before > containerizer recovery is done. This is problematic because containerizer has > no idea about the containers yet, calling containerizer->update will lead to > a failure, causing the container being killed. > {noformat} > [04:04:11]W: [Step 10/10] I0929 04:04:11.693418 22646 > containerizer.cpp:580] Recovering containerizer > [04:04:11]W: [Step 10/10] I0929 04:04:11.693444 22646 > containerizer.cpp:636] Recovering container > 568968cc-f41c-475a-bb2b-45d8babd853d for executor 'default' of framework > 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 > [04:04:11]W: [Step 10/10] I0929 04:04:11.693445 22645 http.cpp:273] HTTP > POST for /agent/api/v1/executor from 172.30.2.198:42683 > [04:04:11]W: [Step 10/10] I0929 04:04:11.693567 22645 slave.cpp:3017] > Received Subscribe request for HTTP executor 'default' of framework > 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP) > [04:04:11]W: [Step 10/10] I0929 04:04:11.693613 22645 slave.cpp:3080] > Creating a marker file for HTTP based executor 'default' of framework > 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP) at path > '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_ROOT_CGROUPS_ReconnectDefaultExecutor_XpQvvJ/meta/slaves/7e4c8518-cb45-4b09-9fa8-c029d56289e2-S0/frameworks/7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000/executors/default/runs/568968cc-f41c-475a-bb2b-45d8babd853d/http.marker' > [04:04:11]W: [Step 10/10] I0929 04:04:11.693733 22645 slave.cpp:3609] > Handling status update TASK_RUNNING (UUID: > 6cc3f9a7-d020-46f0-82c1-39fbb9d43786) for task > db1f9b1b-75d2-4d96-831f-48d6f28301e8 of framework > 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 > [04:04:11]W: [Step 10/10] I0929 04:04:11.693801 22645 slave.cpp:3609] > Handling status update TASK_RUNNING (UUID: > f80d217b-7844-4134-8cc8-db6998ac437e) for task > 3a583cbb-8ea9-440a-864d-e68a23472368 of framework > 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 > [04:04:11]W: [Step 10/10] E0929 04:04:11.694232 22648 slave.cpp:2055] > Failed to update resources for container 568968cc-f41c-475a-bb2b-45d8babd853d > of executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000, > destroying container: Collect failed: Unknown container > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)