[
https://issues.apache.org/jira/browse/MESOS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512918#comment-16512918
]
Gilbert Song commented on MESOS-8416:
-------------------------------------
A workaround for this issue for containers with persistent volumes:
1. find out the framework ids that have checkpointing as false from the
/frameworks endpoint
2. use these framework-ids and traverse all the corresponding container
sandboxes, should be `/var/lib/mesos
slave/slaves/<slave-id>/frameworks/<the-framework-ids-from-step-1>/executors/<executor-id/runs/<container-id>`,
collect all these container ids
3. based on the container ids from step 2, kill these container processes
4. watch the mount points from the host, cat /proc/self/mountinfo, and find out
the mount points that under Step 2's containers’ sandboxes (for monitoring
5. do `umount -R
/var/lib/mesos/slave/slaves/<slave-id>/frameworks/<each-framework-id-from-Step-1>`,
we have to do the umount due to
https://issues.apache.org/jira/browse/MESOS-8830
6. watch the mount points again like step 4, and verify container ids from step
2 do not show up on the mount table
7. remove each container runtime dir at
/var/run/mesos/containers/<container-ids-from-Step-2>, the agent should be able
to recovered then and the old PV should be included in a new offer and being
sent to the framework
> CHECK failure if trying to recover nested containers but the framework
> checkpointing is not enabled.
> ----------------------------------------------------------------------------------------------------
>
> Key: MESOS-8416
> URL: https://issues.apache.org/jira/browse/MESOS-8416
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Gilbert Song
> Assignee: Gilbert Song
> Priority: Blocker
> Labels: containerizer, mesosphere
> Fix For: 1.5.1, 1.6.0
>
>
> {noformat}
> I0108 23:05:25.313344 31743 slave.cpp:620] Agent attributes: [ ]
> I0108 23:05:25.313832 31743 slave.cpp:629] Agent hostname:
> vagrant-ubuntu-wily-64
> I0108 23:05:25.314916 31763 task_status_update_manager.cpp:181] Pausing
> sending task status updates
> I0108 23:05:25.323496 31766 state.cpp:66] Recovering state from
> '/var/lib/mesos/slave/meta'
> I0108 23:05:25.323639 31766 state.cpp:724] No committed checkpointed
> resources found at '/var/lib/mesos/slave/meta/resources/resources.info'
> I0108 23:05:25.326169 31760 task_status_update_manager.cpp:207] Recovering
> task status update manager
> I0108 23:05:25.326954 31759 containerizer.cpp:674] Recovering containerizer
> F0108 23:05:25.331529 31759 containerizer.cpp:919]
> CHECK_SOME(container->directory): is NONE
> *** Check failure stack trace: ***
> @ 0x7f769dbc98bd google::LogMessage::Fail()
> @ 0x7f769dbc8c8e google::LogMessage::SendToLog()
> @ 0x7f769dbc958d google::LogMessage::Flush()
> @ 0x7f769dbcca08 google::LogMessageFatal::~LogMessageFatal()
> @ 0x556cb4c2b937 _CheckFatal::~_CheckFatal()
> @ 0x7f769c5ac653
> mesos::internal::slave::MesosContainerizerProcess::recover()
> {noformat}
> If the framework does not enable the checkpointing. It means there is no
> slave state checkpointed. But containers are still checkpointed at the
> runtime dir, which mean recovering a nested container would cause the CHECK
> failure due to its parent's sandbox dir is unknown.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)