Kill entire containers on OOM with LXC isolation module
-------------------------------------------------------
Key: MESOS-47
URL: https://issues.apache.org/jira/browse/MESOS-47
Project: Mesos
Issue Type: Improvement
Components: isolation
Environment: Linux with container-based isolation
Reporter: Charles Reiss
When using the LXC isolation module, the kernel OOM killer will kill a victim
process in the container when the container exceeds its memory limit. When the
container contains multiple processes this can cause weird failures.
Instead, Mesos should use the memory cgroup's oom_control feature to disable
OOM kills (instead, processes requesting memory block) and have the slave be
informed of OOM events using an eventfd. When the slave receives OOM messages
on this event fd, it should kill all processes in the over-limit executor's
container.
(These OOM events only happen when a container exceeds its hard memory limit.
If Mesos does overcommit of memory in the future, then it should have a outer
cgroup with memory hard limits and memory.use_hierarchy enabled on which to get
OOM events (so they don't turn into global OOM kills). Mesos will need to have
code to figure out which executors are exceeding their "soft" memory limits and
choose a victim executor.)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira