[
https://issues.apache.org/jira/browse/MESOS-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649199#comment-13649199
]
Benjamin Mahler commented on MESOS-461:
---------------------------------------
For a fix here, I'd like to improve the logging and increase the amount of time
we allow for the cgroup to become empty.
I'd also like to see whether this comment is indeed true in the Freezer when
the cgroup is in the FREEZING state, as this was the source of the failure:
// We don't need to worry about the race condition here as it is not
// possible to add new processes into this cgroup or remove processes from
// this cgroup when freezer.state is in FREEZING state.
foreach (pid_t pid, pids.get()) {
Try<proc::ProcessStatistics> stat = proc::stat(pid);
if (stat.isError()) {
promise.fail("Failed to get process statistics: " + stat.error());
terminate(self());
return;
}
> Freezer invoked multiple times for the same cgroup.
> ---------------------------------------------------
>
> Key: MESOS-461
> URL: https://issues.apache.org/jira/browse/MESOS-461
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
>
> I kept the logs relevant to a second executor 'gc', in case they are relevant.
> An OOM is triggered for the executor, after which we freeze it successfully.
> I0504 20:48:44.481727 44475 cgroups_isolation_module.cpp:819] OOM notifier is
> triggered for executor E1 of framework F with tag T
> I0504 20:48:44.571171 44475 cgroups_isolation_module.cpp:863] OOM detected
> for executor E1 of framework F with tag T
> I0504 20:48:44.573693 44475 cgroups_isolation_module.cpp:889] MEMORY LIMIT:
> 1707081728 bytes
> MEMORY USAGE: 1707081728 bytes
> MEMORY STATISTICS:
> cache 229376
> rss 1706852352
> mapped_file 16384
> pgpgin 4722471
> pgpgout 4305703
> inactive_anon 0
> active_anon 1706713088
> inactive_file 102400
> active_file 53248
> unevictable 0
> hierarchical_memory_limit 1707081728
> total_cache 229376
> total_rss 1706852352
> total_mapped_file 16384
> total_pgpgin 4722471
> total_pgpgout 4305703
> total_inactive_anon 0
> total_active_anon 1706713088
> total_inactive_file 102400
> total_active_file 53248
> total_unevictable 0
> I0504 20:48:44.573789 44475 cgroups_isolation_module.cpp:534] Killing
> executor E1 of framework F
> I0504 20:48:44.578125 44471 cgroups.cpp:1146] Trying to freeze cgroup
> /cgroup/mesos/F_executor_E1_tag_UUID
> I0504 20:48:46.009472 44464 slave.cpp:830] Status update: task
> system-gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F is now in state
> TASK_FINISHED
> I0504 20:48:46.014714 44464 slave.cpp:727] Got acknowledgement of status
> update for task system-gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F
> I0504 20:48:46.015807 44470 cgroups_isolation_module.cpp:571] Changing cgroup
> controls for executor gc-30cd1abb-84cf-4a12-8c1b-13396a27867f of framework F
> with resources cpus=0.19; disk=15; mem=127
> I0504 20:48:46.024636 44470 cgroups_isolation_module.cpp:676] Updated
> 'cpu.shares' to 194 for executor gc of framework F
> I0504 20:48:46.043144 44470 cgroups_isolation_module.cpp:774] Updated
> 'memory.soft_limit_in_bytes' to 133169152 for executor gc of framework F
> I0504 20:48:51.829247 44472 cgroups_isolation_module.cpp:633] Telling slave
> of terminated executor gc of framework F
> I0504 20:48:51.830458 44472 cgroups_isolation_module.cpp:534] Killing
> executor gc of framework F
> I0504 20:48:51.830574 44462 slave.cpp:1053] Executor 'gc' of framework F has
> exited with status 0
> I0504 20:48:51.846076 44472 cgroups_isolation_module.cpp:819] OOM notifier is
> triggered for executor gc of framework F with tag T
> I0504 20:48:51.846158 44472 cgroups_isolation_module.cpp:824] Discarded OOM
> notifier for executor gc of framework F with tag T
> I0504 20:48:51.859122 44472 cgroups.cpp:1146] Trying to freeze cgroup
> /cgroup/mesos/framework_F_executor_gc_tag_T
> I0504 20:48:51.859297 44472 cgroups.cpp:1185] Successfully froze cgroup
> /cgroup/mesos/framework_F_executor_gc_tag_T after 1 attempts
> I0504 20:48:51.893443 44473 cgroups.cpp:1161] Trying to thaw cgroup
> /cgroup/mesos/framework_F_executor_gc_tag_T
> I0504 20:48:51.893656 44473 cgroups.cpp:1268] Successfully thawed
> /cgroup/mesos/framework_F_executor_gc_tag_T
> I0504 20:48:51.946413 44473 cgroups_isolation_module.cpp:903] Successfully
> destroyed the cgroup mesos/framework_F_executor_gc_tag_T
> I0504 20:49:05.155886 44463 cgroups.cpp:1185] Successfully froze cgroup
> /cgroup/mesos/framework_F_executor_E1_tag_UUID after 5 attempts
> I0504 20:49:05.181869 44472 cgroups.cpp:1161] Trying to thaw cgroup
> /cgroup/mesos/framework_F_executor_E1_tag_UUID
> I0504 20:49:05.195574 44472 cgroups.cpp:1268] Successfully thawed
> /cgroup/mesos/framework_F_executor_E1_tag_UUID
> At this point, the cgroup is frozen and thawed, and we then to freeze the
> cgroup again!
> I0504 20:49:11.364585 44468 cgroups.cpp:1146] Trying to freeze cgroup
> /cgroup/mesos/framework_F_executor_E1_tag_UUID
> F0504 20:49:11.687748 44468 cgroups_isolation_module.cpp:905] Failed to
> destroy the cgroup mesos/framework_F_executor_E1_tag_UUID: Failed to kill
> tasks in nested cgroups: Collect failed: Failed to get process statistics:
> Failed to open '/proc/33071/stat'
> *** Check failure stack trace: ***
> @ 0x7fb4c489169d google::LogMessage::Fail()
> @ 0x7fb4c4897307 google::LogMessage::SendToLog()
> @ 0x7fb4c4892f4c google::LogMessage::Flush()
> @ 0x7fb4c48931b6 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fb4c46431f0
> mesos::internal::slave::CgroupsIsolationModule::destroyWaited()
> @ 0x7fb4c4653f77 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7fb4c4656c24 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7fb4c478cd7b std::tr1::function<>::operator()()
> @ 0x7fb4c4745f9f process::ProcessBase::visit()
> @ 0x7fb4c475aec8 process::DispatchEvent::visit()
> @ 0x7fb4c474f11d process::ProcessManager::resume()
> @ 0x7fb4c474f968 process::schedule()
> @ 0x7fb4c3e6873d start_thread
> @ 0x7fb4c284cf6d clone
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira