Greg Mann created MESOS-6302:
--------------------------------
Summary: Agent recovery can fail after nested containers are
launched
Key: MESOS-6302
URL: https://issues.apache.org/jira/browse/MESOS-6302
Project: Mesos
Issue Type: Bug
Reporter: Greg Mann
Assignee: Gilbert Song
Priority: Blocker
Fix For: 1.1.0
After launching a nested container which used a Docker image, I restarted the
agent which ran that task group and saw the following in the agent logs during
recovery:
{code}
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
I1001 01:45:10.813596 4640 status_update_manager.cpp:203] Recovering status
update manager
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
I1001 01:45:10.813622 4640 status_update_manager.cpp:211] Recovering executor
'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of framework
118ca38d-daee-4b2d-b584-b5581738a3dd-0000
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
I1001 01:45:10.814249 4639 docker.cpp:745] Recovering Docker containers
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
I1001 01:45:10.815294 4642 containerizer.cpp:581] Recovering containerizer
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
Failed to perform recovery: Collect failed: Unable to list rootfses belonged to
container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the container
directory: Failed to opendir
'/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends':
No such file or directory
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: To
remedy this do as follows:
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
This ensures agent doesn't recover old live executors.
Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:
Step 2: Restart the agent.
{code}
and the agent continues to restart in this fashion. Attached is the Marathon
app definition that I used to launch the task group.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)