[jira] [Commented] (MESOS-8416) CHECK failure if trying to recover nested containers but the framework checkpointing is not enabled.

Gilbert Song (JIRA) Wed, 04 Apr 2018 16:05:10 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426307#comment-16426307
 ]


Gilbert Song commented on MESOS-8416:
-------------------------------------

This could be a blocking issue for agent recovering. Users are likely hit this 
issue if they use nested containers, in two cases:
# agent metadata got delated by operators for some reasons.
# custom framework did not set checkpoint as true but launched some taskgroup 
and completed.

> CHECK failure if trying to recover nested containers but the framework 
> checkpointing is not enabled.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8416
>                 URL: https://issues.apache.org/jira/browse/MESOS-8416
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Gilbert Song
>            Priority: Minor
>              Labels: containerizer
>             Fix For: 1.5.1
>
>
> {noformat}
> I0108 23:05:25.313344 31743 slave.cpp:620] Agent attributes: [  ]
> I0108 23:05:25.313832 31743 slave.cpp:629] Agent hostname: 
> vagrant-ubuntu-wily-64
> I0108 23:05:25.314916 31763 task_status_update_manager.cpp:181] Pausing 
> sending task status updates
> I0108 23:05:25.323496 31766 state.cpp:66] Recovering state from 
> '/var/lib/mesos/slave/meta'
> I0108 23:05:25.323639 31766 state.cpp:724] No committed checkpointed 
> resources found at '/var/lib/mesos/slave/meta/resources/resources.info'
> I0108 23:05:25.326169 31760 task_status_update_manager.cpp:207] Recovering 
> task status update manager
> I0108 23:05:25.326954 31759 containerizer.cpp:674] Recovering containerizer
> F0108 23:05:25.331529 31759 containerizer.cpp:919] 
> CHECK_SOME(container->directory): is NONE 
> *** Check failure stack trace: ***
>     @     0x7f769dbc98bd  google::LogMessage::Fail()
>     @     0x7f769dbc8c8e  google::LogMessage::SendToLog()
>     @     0x7f769dbc958d  google::LogMessage::Flush()
>     @     0x7f769dbcca08  google::LogMessageFatal::~LogMessageFatal()
>     @     0x556cb4c2b937  _CheckFatal::~_CheckFatal()
>     @     0x7f769c5ac653  
> mesos::internal::slave::MesosContainerizerProcess::recover()
> {noformat}
> If the framework does not enable the checkpointing. It means there is no 
> slave state checkpointed. But containers are still checkpointed at the 
> runtime dir, which mean recovering a nested container would cause the CHECK 
> failure due to its parent's sandbox dir is unknown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8416) CHECK failure if trying to recover nested containers but the framework checkpointing is not enabled.

Reply via email to