Ian Downes created MESOS-1689:
---------------------------------
Summary: Race with kernel to kill process / destroy cgroup after
OOM
Key: MESOS-1689
URL: https://issues.apache.org/jira/browse/MESOS-1689
Project: Mesos
Issue Type: Bug
Affects Versions: 0.20.0
Reporter: Ian Downes
The recently refactored cgroup::destroy code can fail to freeze a freezer
cgroup under a particular ordering of events, as detailed below. The
LinuxLauncher will fail to destroy the cgroup and other isolators will not be
able to destroy their cgroups.
This failure will be logged but otherwise ignored by a running slave. If the
slave is subsequently restarted it will block on the cgroup::destroy during
launcher recovery, timing out after 60 seconds and causing recovery to fail,
which will then cause the slave to terminate. If the slave is monitored and
automatically restarted it will repeatedly flap.
The problem appears as a container freezer cgroup that will not transition to
FROZEN and remains in FREEZING. This is because one or more processes cannot be
frozen.
{noformat}
[idownes@hostname ~]$ cat
/sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state
FREEZING
[idownes@hostname ~]$ cat
/sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs
| xargs ps -L
PID LWP TTY STAT TIME COMMAND
29369 29369 ? Dsl 0:02 python2.6 ./thermos_executor
29369 29482 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29483 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29484 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29485 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29486 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29487 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29488 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29489 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29490 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29491 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29492 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29493 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29494 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29495 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29496 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29497 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29498 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29499 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29500 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29582 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29583 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29584 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29585 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29604 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29605 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29606 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29607 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29608 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29610 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29612 ? Dsl 0:00 python2.6 ./thermos_executor
29369 29613 ? Dsl 0:00 python2.6 ./thermos_executor
29526 29526 ? D 0:02 python2.6
/var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29578 29578 ? Ds 29:49 python2.6
/var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29603 29603 ? R 254:08 python2.6 /usr/local/bin/package_cache
719808749561cd7e77d8a22df9f36643 hftp://hadoop-r
{noformat}
Debugging with [~jieyu] indicates the following sequence of events:
1. Cgroup reaches memory limit
2. Kernel notifies Mesos that an OOM condition has occured.
3. Mesos initiates freezing the cgroup.
4. There is now a race in the kernel between freezing processes and deciding
which process to kill. The kernel does check a process is suitable for killing
and will thaw the process if necessary but there is a window between this check
and actually signaling the process. It can occur that the selected process is
frozen before the signal is delivered and thus the process does not die, and
therefore does not release its memory.
{noformat}
[idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
Aug 8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of
memory: Kill process 29578 (python2.6) score 413 or sacrifice child
Aug 8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of
memory: Kill process 29578 (python2.6) score 413 or sacrifice child
[idownes@hostname ~]$ cat /proc/29578/status | grep State
State: D (disk sleep)
[idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
SigPnd: 0000000000000100
[idownes@hostname ~]$ sudo cat /proc/29578/stack
[sudo] password for idownes:
[<ffffffff8107b17d>] __refrigerator+0xc9/0x125
[<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
[<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
[<ffffffff81002217>] do_signal+0x49/0x5ec
[<ffffffff810027e3>] do_notify_resume+0x29/0x65
[<ffffffff814abd12>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}
5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.
6. Meanwhile, another process in the cgroup tries to allocate a page, this
triggers a page fault which fails because the cgroup is at its limit (the
selected process hasn't been killed to free memory). This process keeps
faulting and repeatedly jumps between runnable/running and uninterruptible.
Regardless, it does not get frozen.
{noformat}
[idownes@hostname ~]$ sudo cat /proc/29603/stack
[<ffffffff81065246>] __cond_resched+0x26/0x30
[<ffffffff81125abd>] vmpressure+0x3b/0x7c
[<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
[<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
[<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
[<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
[<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
[<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
[<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
[<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
[<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
[<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
[<ffffffff814a82ba>] do_page_fault+0xe/0x10
[<ffffffff814a53f2>] page_fault+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}
7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up and
the destroy is discarded.
The cgroup can be unstuck by either a) killing the faulting process, which lets
the cgroup fully freeze, or b) THAWing the cgroup which delivers the SIGKILL to
the selected process, freeing memory for the faulting process to continue so
the next freeze attempt succeeds.
--
This message was sent by Atlassian JIRA
(v6.2#6252)