[
https://issues.apache.org/jira/browse/MESOS-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661164#comment-13661164
]
Vinod Kone commented on MESOS-473:
----------------------------------
According to
https://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt, unable
to write 'FROZEN' is an expected behavior. So, instead of failing we should
retry it.
It's important to note that freezing can be incomplete. In that case we return
EBUSY. This means that some tasks in the cgroup are busy doing something that
prevents us from completely freezing the cgroup at this time. After EBUSY,
the cgroup will remain partially frozen -- reflected by freezer.state reporting
"FREEZING" when read. The state will remain "FREEZING" until one of these
things happens:
1) Userspace cancels the freezing operation by writing "THAWED" to
the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
the freezer.state file (writing "FREEZING" is not legal
and returns EINVAL)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.
> Freezer fails fatally when it is unable to write 'FROZEN' to freezer.state
> --------------------------------------------------------------------------
>
> Key: MESOS-473
> URL: https://issues.apache.org/jira/browse/MESOS-473
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.10.0, 0.11.0, 0.12.0, 0.13.0
> Reporter: Vinod Kone
> Assignee: Vinod Kone
> Fix For: 0.13.0
>
>
> Observed this when running tests in a loop. This was
> SlaveRecoveryTest.RecoverTerminatedExecutor.
> F0517 22:40:00.163806 9004 cgroups_isolator.cpp:1165] Failed to destroy
> cgroup
> mesos_test/framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8:
> Failed to kill tasks in nested cgroups: Collect failed: Failed to write
> control 'freezer.state': Device or resource busy
> *** Check failure stack trace: ***
> @ 0x7facb0d080ed google::LogMessage::Fail()
> @ 0x7facb0d0dd57 google::LogMessage::SendToLog()
> @ 0x7facb0d0999c google::LogMessage::Flush()
> @ 0x7facb0d09c06 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7facb0a96837
> mesos::internal::slave::CgroupsIsolator::_killExecutor()
> @ 0x7facb0aaa6b0 std::tr1::_Mem_fn<>::operator()()
> @ 0x7facb0aabdce std::tr1::_Bind<>::operator()<>()
> @ 0x7facb0aabdfd std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7facb0ab1043 std::tr1::function<>::operator()()
> @ 0x7facb0ab875e process::internal::vdispatcher<>()
> @ 0x7facb0ab9b98 std::tr1::_Bind<>::operator()<>()
> @ 0x7facb0ab9bed std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7facb0c09059 std::tr1::function<>::operator()()
> @ 0x7facb0bcf54d process::ProcessBase::visit()
> @ 0x7facb0be43ca process::DispatchEvent::visit()
> @ 0x5fcd90 process::ProcessBase::serve()
> @ 0x7facb0bd8e3d process::ProcessManager::resume()
> @ 0x7facb0bd9688 process::schedule()
> @ 0x7facafcb473d start_thread
> @ 0x7facae698f6d clone
> The process state of tasks in cgroup are either in un-interruptible sleep
> ('D') or traced ('T'):
> [vinod@smfd-bkq-03-sr4
> framework_201305172240-1740121354-46893-8981-0000_executor_59f49d23-9b61-4d08-868c-87af1b06a019_tag_8be5f3f8-e0ce-40d6-83dc-9866a984cbb8]$
> cat tasks | xargs ps -F -p
> UID PID PPID C SZ RSS PSR STIME TTY STAT TIME CMD
> root 25761 1 0 91854 15648 4 22:39 ? Dl 0:00
> /home/vinod/mesos/build/src/.libs/lt-mesos-executor
> root 25802 25761 0 14734 544 13 22:39 ? Ts 0:00 sleep 1000
> root 25804 25761 0 15961 1296 7 22:39 ? D 0:00 /bin/bash
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> root 25814 25804 0 15961 224 14 22:39 ? D 0:00 /bin/bash
> /home/vinod/mesos/build/../src/scripts/killtree.sh -p 25802 -s 15 -g -x -v
> gdb hangs when trying to attach to the mesos executor, likely because its in
> 'D' state.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira