[
https://issues.apache.org/jira/browse/MESOS-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879140#comment-15879140
]
Jie Yu commented on MESOS-7152:
-------------------------------
commit 020b37ee9c44007ecd0016fbbf6012054953dd5b
Author: Gilbert Song <[email protected]>
Date: Wed Feb 22 09:10:58 2017 -0800
Fixed nested container agent flapping issue after reboot.
When recovering containers in provisioner, there is a particular case
that after the machine reboots the container runtime directory and
slave state is gone but the provisioner directory still exists since
it is under the agent work_dir(e.g., agent work_dir is /var/lib/mesos).
Then, all checkpointed containers will be cleaned up as unknown
containers in provisioner during recovery. However, the semantic that
a child container is always cleaned up before its parent container
cannot be guaranteed for this particular case. Ideally, we should
put the provisioner directory under the container runtime dir but this
is not backward compactible. It is an unfortunate that we have to
make the provisioner::destroy() to be recursive.
Review: https://reviews.apache.org/r/56808/
> The agent may be flapping after the machine reboots due to provisioner
> recover.
> -------------------------------------------------------------------------------
>
> Key: MESOS-7152
> URL: https://issues.apache.org/jira/browse/MESOS-7152
> Project: Mesos
> Issue Type: Bug
> Reporter: Gilbert Song
> Assignee: Gilbert Song
> Priority: Blocker
> Labels: nested, provisioner
>
> After the agent machine reboots, if the agent work dir survives (e.g.,
> /var/lib/mesos) and the container runtime directory is gone (an empty
> SlaveState as well), the provisioner recover() would get into segfault
> because that case break the semantic that a child container should always be
> cleaned up before it parent container.
> This is a particular case which only happens if the machine reboots and the
> provisioner directory survives.
> {noformat}
> F0217 01:10:18.423238 30099 provisioner.cpp:504] Check failed: entry.parent()
> != containerId Failed to destroy container 1 since its nested container 1.2
> has not been destroyed yet
> *** Check failure stack trace: ***
> @ 0x7fceb444121d google::LogMessage::Fail()
> @ 0x7fceb44405ee google::LogMessage::SendToLog()
> @ 0x7fceb4440eed google::LogMessage::Flush()
> @ 0x7fceb4444368 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fceb36137f9
> mesos::internal::slave::ProvisionerProcess::destroy()
> @ 0x7fceb36126f0
> mesos::internal::slave::ProvisionerProcess::recover()
> @ 0x7fceb3637fc6
> _ZZN7process8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS2_11ContainerIDESt4hashIS7_ESt8equal_toIS7_EESC_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESS_
> @ 0x7fceb3637bc2
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS6_11ContainerIDESt4hashISB_ESt8equal_toISB_EESG_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSN_FSL_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7fceb43848e4 std::function<>::operator()()
> @ 0x7fceb436baf4 process::ProcessBase::visit()
> @ 0x7fceb43e5fde process::DispatchEvent::visit()
> @ 0x9e4101 process::ProcessBase::serve()
> @ 0x7fceb4369007 process::ProcessManager::resume()
> @ 0x7fceb4377a8c
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x7fceb4377995
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7fceb4377965 std::_Bind_simple<>::operator()()
> @ 0x7fceb437793c std::thread::_Impl<>::_M_run()
> @ 0x7fceadefa030 (unknown)
> @ 0x7fcead70b6aa start_thread
> @ 0x7fcead440e9d (unknown)
> {noformat}
> The provisioner directory is supposed to be under the container runtime
> directory. However, this is not backward compatible. We can only change it
> after a deprecation cycle.
> For now, we have to three options:
> 1. make provisioner::destroy() recursive.
> 2. sort the container during recovery to guarantee `child before parent`
> semantic.
> 3. remove the check-failure since the while provisioner dir will be removed
> eventually at the end (not recommended).
> Recommend (1).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)