[
https://issues.apache.org/jira/browse/MESOS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426307#comment-16426307
]
Gilbert Song commented on MESOS-8416:
-------------------------------------
This could be a blocking issue for agent recovering. Users are likely hit this
issue if they use nested containers, in two cases:
# agent metadata got delated by operators for some reasons.
# custom framework did not set checkpoint as true but launched some taskgroup
and completed.
> CHECK failure if trying to recover nested containers but the framework
> checkpointing is not enabled.
> ----------------------------------------------------------------------------------------------------
>
> Key: MESOS-8416
> URL: https://issues.apache.org/jira/browse/MESOS-8416
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Reporter: Gilbert Song
> Priority: Minor
> Labels: containerizer
> Fix For: 1.5.1
>
>
> {noformat}
> I0108 23:05:25.313344 31743 slave.cpp:620] Agent attributes: [ ]
> I0108 23:05:25.313832 31743 slave.cpp:629] Agent hostname:
> vagrant-ubuntu-wily-64
> I0108 23:05:25.314916 31763 task_status_update_manager.cpp:181] Pausing
> sending task status updates
> I0108 23:05:25.323496 31766 state.cpp:66] Recovering state from
> '/var/lib/mesos/slave/meta'
> I0108 23:05:25.323639 31766 state.cpp:724] No committed checkpointed
> resources found at '/var/lib/mesos/slave/meta/resources/resources.info'
> I0108 23:05:25.326169 31760 task_status_update_manager.cpp:207] Recovering
> task status update manager
> I0108 23:05:25.326954 31759 containerizer.cpp:674] Recovering containerizer
> F0108 23:05:25.331529 31759 containerizer.cpp:919]
> CHECK_SOME(container->directory): is NONE
> *** Check failure stack trace: ***
> @ 0x7f769dbc98bd google::LogMessage::Fail()
> @ 0x7f769dbc8c8e google::LogMessage::SendToLog()
> @ 0x7f769dbc958d google::LogMessage::Flush()
> @ 0x7f769dbcca08 google::LogMessageFatal::~LogMessageFatal()
> @ 0x556cb4c2b937 _CheckFatal::~_CheckFatal()
> @ 0x7f769c5ac653
> mesos::internal::slave::MesosContainerizerProcess::recover()
> {noformat}
> If the framework does not enable the checkpointing. It means there is no
> slave state checkpointed. But containers are still checkpointed at the
> runtime dir, which mean recovering a nested container would cause the CHECK
> failure due to its parent's sandbox dir is unknown.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)