[jira] [Commented] (MESOS-473) Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state

Vinod Kone (JIRA) Fri, 17 May 2013 17:01:17 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661164#comment-13661164
 ]


Vinod Kone commented on MESOS-473:
----------------------------------

According to 
https://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt, unable 
to write 'FROZEN' is an expected behavior. So, instead of failing we should 
retry it.

It's important to note that freezing can be incomplete. In that case we return
EBUSY. This means that some tasks in the cgroup are busy doing something that
prevents us from completely freezing the cgroup at this time. After EBUSY,
the cgroup will remain partially frozen -- reflected by freezer.state reporting
"FREEZING" when read. The state will remain "FREEZING" until one of these
things happens:

        1) Userspace cancels the freezing operation by writing "THAWED" to
                the freezer.state file
        2) Userspace retries the freezing operation by writing "FROZEN" to
                the freezer.state file (writing "FREEZING" is not legal
                and returns EINVAL)
        3) The tasks that blocked the cgroup from entering the "FROZEN"
                state disappear from the cgroup's set of tasks.


                
> Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state
> --------------------------------------------------------------------------
>
>                 Key: MESOS-473
>                 URL: https://issues.apache.org/jira/browse/MESOS-473
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.11.0, 0.12.0, 0.13.0
>            Reporter: Vinod Kone
>            Assignee: Vinod Kone
>             Fix For: 0.13.0
>
>
> Observed this when running tests in a loop. This was 
> SlaveRecoveryTest.RecoverTerminatedExecutor.
> F0517 22:40:00.163806  9004 cgroups_isolator.cpp:1165] Failed to destroy 
> cgroup 
> mesos_test/framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8:
>  Failed to kill tasks in nested cgroups: Collect failed: Failed to write 
> control 'freezer.state': Device or resource busy
> *** Check failure stack trace: ***
>     @     0x7facb0d080ed  google::LogMessage::Fail()
>     @     0x7facb0d0dd57  google::LogMessage::SendToLog()
>     @     0x7facb0d0999c  google::LogMessage::Flush()
>     @     0x7facb0d09c06  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7facb0a96837  
> mesos::internal::slave::CgroupsIsolator::_killExecutor()
>     @     0x7facb0aaa6b0  std::tr1::_Mem_fn<>::operator()()
>     @     0x7facb0aabdce  std::tr1::_Bind<>::operator()<>()
>     @     0x7facb0aabdfd  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7facb0ab1043  std::tr1::function<>::operator()()
>     @     0x7facb0ab875e  process::internal::vdispatcher<>()
>     @     0x7facb0ab9b98  std::tr1::_Bind<>::operator()<>()
>     @     0x7facb0ab9bed  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7facb0c09059  std::tr1::function<>::operator()()
>     @     0x7facb0bcf54d  process::ProcessBase::visit()
>     @     0x7facb0be43ca  process::DispatchEvent::visit()
>     @           0x5fcd90  process::ProcessBase::serve()
>     @     0x7facb0bd8e3d  process::ProcessManager::resume()
>     @     0x7facb0bd9688  process::schedule()
>     @     0x7facafcb473d  start_thread
>     @     0x7facae698f6d  clone
> The process state of tasks in cgroup are either in un-interruptible sleep 
> ('D') or traced ('T'):
> [vinod@smfd-bkq-03-sr4 
> framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8]$
>  cat tasks | xargs ps -F -p
> UID        PID  PPID  C    SZ   RSS PSR STIME TTY      STAT   TIME CMD
> root     25761     1  0 91854 15648   4 22:39 ?        Dl     0:00 
> /home/vinod/mesos/build/src/.libs/lt-mesos-executor
> root     25802 25761  0 14734   544  13 22:39 ?        Ts     0:00 sleep 1000
> root     25804 25761  0 15961  1296   7 22:39 ?        D      0:00 /bin/bash 
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> root     25814 25804  0 15961   224  14 22:39 ?        D      0:00 /bin/bash 
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> gdb hangs when trying to attach to the mesos executor, likely because its in 
> 'D' state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-473) Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state

Reply via email to