[jira] [Commented] (MESOS-8416) CHECK failure if trying to recover nested containers but the framework checkpointing is not enabled.

Gilbert Song (JIRA) Thu, 14 Jun 2018 12:59:08 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512918#comment-16512918
 ]


Gilbert Song commented on MESOS-8416:
-------------------------------------

A workaround for this issue for containers with persistent volumes:
1. find out the framework ids that have checkpointing as false from the 
/frameworks endpoint
2. use these framework-ids and traverse all the corresponding container 
sandboxes, should be `/var/lib/mesos 
slave/slaves/<slave-id>/frameworks/<the-framework-ids-from-step-1>/executors/<executor-id/runs/<container-id>`,
 collect all these container ids
3. based on the container ids from step 2, kill these container processes
4. watch the mount points from the host, cat /proc/self/mountinfo, and find out 
the mount points that under Step 2's containers’ sandboxes (for monitoring
5. do `umount -R 
/var/lib/mesos/slave/slaves/<slave-id>/frameworks/<each-framework-id-from-Step-1>`,
 we have to do the umount due to 
https://issues.apache.org/jira/browse/MESOS-8830
6. watch the mount points again like step 4, and verify container ids from step 
2 do not show up on the mount table
7. remove each container runtime dir at 
/var/run/mesos/containers/<container-ids-from-Step-2>, the agent should be able 
to recovered then and the old PV should be included in a new offer and being 
sent to the framework

> CHECK failure if trying to recover nested containers but the framework 
> checkpointing is not enabled.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8416
>                 URL: https://issues.apache.org/jira/browse/MESOS-8416
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Assignee: Gilbert Song
>            Priority: Blocker
>              Labels: containerizer, mesosphere
>             Fix For: 1.5.1, 1.6.0
>
>
> {noformat}
> I0108 23:05:25.313344 31743 slave.cpp:620] Agent attributes: [  ]
> I0108 23:05:25.313832 31743 slave.cpp:629] Agent hostname: 
> vagrant-ubuntu-wily-64
> I0108 23:05:25.314916 31763 task_status_update_manager.cpp:181] Pausing 
> sending task status updates
> I0108 23:05:25.323496 31766 state.cpp:66] Recovering state from 
> '/var/lib/mesos/slave/meta'
> I0108 23:05:25.323639 31766 state.cpp:724] No committed checkpointed 
> resources found at '/var/lib/mesos/slave/meta/resources/resources.info'
> I0108 23:05:25.326169 31760 task_status_update_manager.cpp:207] Recovering 
> task status update manager
> I0108 23:05:25.326954 31759 containerizer.cpp:674] Recovering containerizer
> F0108 23:05:25.331529 31759 containerizer.cpp:919] 
> CHECK_SOME(container->directory): is NONE 
> *** Check failure stack trace: ***
>     @     0x7f769dbc98bd  google::LogMessage::Fail()
>     @     0x7f769dbc8c8e  google::LogMessage::SendToLog()
>     @     0x7f769dbc958d  google::LogMessage::Flush()
>     @     0x7f769dbcca08  google::LogMessageFatal::~LogMessageFatal()
>     @     0x556cb4c2b937  _CheckFatal::~_CheckFatal()
>     @     0x7f769c5ac653  
> mesos::internal::slave::MesosContainerizerProcess::recover()
> {noformat}
> If the framework does not enable the checkpointing. It means there is no 
> slave state checkpointed. But containers are still checkpointed at the 
> runtime dir, which mean recovering a nested container would cause the CHECK 
> failure due to its parent's sandbox dir is unknown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8416) CHECK failure if trying to recover nested containers but the framework checkpointing is not enabled.

Reply via email to