[ 
https://issues.apache.org/jira/browse/MESOS-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879140#comment-15879140
 ] 

Jie Yu commented on MESOS-7152:
-------------------------------

commit 020b37ee9c44007ecd0016fbbf6012054953dd5b
Author: Gilbert Song <[email protected]>
Date:   Wed Feb 22 09:10:58 2017 -0800

    Fixed nested container agent flapping issue after reboot.

    When recovering containers in provisioner, there is a particular case
    that after the machine reboots the container runtime directory and
    slave state is gone but the provisioner directory still exists since
    it is under the agent work_dir(e.g., agent work_dir is /var/lib/mesos).
    Then, all checkpointed containers will be cleaned up as unknown
    containers in provisioner during recovery. However, the semantic that
    a child container is always cleaned up before its parent container
    cannot be guaranteed for this particular case. Ideally, we should
    put the provisioner directory under the container runtime dir but this
    is not backward compactible. It is an unfortunate that we have to
    make the provisioner::destroy() to be recursive.

    Review: https://reviews.apache.org/r/56808/

> The agent may be flapping after the machine reboots due to provisioner 
> recover.
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-7152
>                 URL: https://issues.apache.org/jira/browse/MESOS-7152
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gilbert Song
>            Assignee: Gilbert Song
>            Priority: Blocker
>              Labels: nested, provisioner
>
> After the agent machine reboots, if the agent work dir survives (e.g., 
> /var/lib/mesos) and the container runtime directory is gone (an empty 
> SlaveState as well), the provisioner recover() would get into segfault 
> because that case break the semantic that a child container should always be 
> cleaned up before it parent container.
> This is a particular case which only happens if the machine reboots and the 
> provisioner directory survives.
> {noformat}
> F0217 01:10:18.423238 30099 provisioner.cpp:504] Check failed: entry.parent() 
> != containerId Failed to destroy container 1 since its nested container 1.2 
> has not been destroyed yet
> *** Check failure stack trace: ***
>     @     0x7fceb444121d  google::LogMessage::Fail()
>     @     0x7fceb44405ee  google::LogMessage::SendToLog()
>     @     0x7fceb4440eed  google::LogMessage::Flush()
>     @     0x7fceb4444368  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fceb36137f9  
> mesos::internal::slave::ProvisionerProcess::destroy()
>     @     0x7fceb36126f0  
> mesos::internal::slave::ProvisionerProcess::recover()
>     @     0x7fceb3637fc6  
> _ZZN7process8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS2_11ContainerIDESt4hashIS7_ESt8equal_toIS7_EESC_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESS_
>     @     0x7fceb3637bc2  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS6_11ContainerIDESt4hashISB_ESt8equal_toISB_EESG_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSN_FSL_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
>     @     0x7fceb43848e4  std::function<>::operator()()
>     @     0x7fceb436baf4  process::ProcessBase::visit()
>     @     0x7fceb43e5fde  process::DispatchEvent::visit()
>     @           0x9e4101  process::ProcessBase::serve()
>     @     0x7fceb4369007  process::ProcessManager::resume()
>     @     0x7fceb4377a8c  
> process::ProcessManager::init_threads()::$_2::operator()()
>     @     0x7fceb4377995  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
>     @     0x7fceb4377965  std::_Bind_simple<>::operator()()
>     @     0x7fceb437793c  std::thread::_Impl<>::_M_run()
>     @     0x7fceadefa030  (unknown)
>     @     0x7fcead70b6aa  start_thread
>     @     0x7fcead440e9d  (unknown)
> {noformat}
> The provisioner directory is supposed to be under the container runtime 
> directory. However, this is not backward compatible. We can only change it 
> after a deprecation cycle.
> For now, we have to three options:
> 1. make provisioner::destroy() recursive.
> 2. sort the container during recovery to guarantee `child before parent` 
> semantic.
> 3. remove the check-failure since the while provisioner dir will be removed 
> eventually at the end (not recommended).
> Recommend (1).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to