Ian Downes created MESOS-1689:
---------------------------------

             Summary: Race with kernel to kill process / destroy cgroup after 
OOM
                 Key: MESOS-1689
                 URL: https://issues.apache.org/jira/browse/MESOS-1689
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.20.0
            Reporter: Ian Downes


The recently refactored cgroup::destroy code can fail to freeze a freezer 
cgroup under a particular ordering of events, as detailed below. The 
LinuxLauncher will fail to destroy the cgroup and other isolators will not be 
able to destroy their cgroups.

This failure will be logged but otherwise ignored by a running slave. If the 
slave is subsequently restarted it will block on the cgroup::destroy during 
launcher recovery, timing out after 60 seconds and causing recovery to fail, 
which will then cause the slave to terminate. If the slave is monitored and 
automatically restarted it will repeatedly flap.

The problem appears as a container freezer cgroup that will not transition to 
FROZEN and remains in FREEZING. This is because one or more processes cannot be 
frozen.

{noformat}
[idownes@hostname ~]$ cat 
/sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state 
        FREEZING
[idownes@hostname ~]$ cat 
/sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs  
| xargs ps -L
  PID   LWP TTY      STAT   TIME COMMAND
29369 29369 ?        Dsl    0:02 python2.6 ./thermos_executor
29369 29482 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29483 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29484 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29485 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29486 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29487 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29488 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29489 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29490 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29491 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29492 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29493 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29494 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29495 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29496 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29497 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29498 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29499 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29500 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29582 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29583 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29584 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29585 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29604 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29605 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29606 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29607 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29608 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29610 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29612 ?        Dsl    0:00 python2.6 ./thermos_executor
29369 29613 ?        Dsl    0:00 python2.6 ./thermos_executor
29526 29526 ?        D      0:02 python2.6 
/var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29578 29578 ?        Ds    29:49 python2.6 
/var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
29603 29603 ?        R    254:08 python2.6 /usr/local/bin/package_cache 
719808749561cd7e77d8a22df9f36643 hftp://hadoop-r

{noformat}

Debugging with [~jieyu] indicates the following sequence of events:
1. Cgroup reaches memory limit

2. Kernel notifies Mesos that an OOM condition has occured.

3. Mesos initiates freezing the cgroup.

4. There is now a race in the kernel between freezing processes and deciding 
which process to kill. The kernel does check a process is suitable for killing 
and will thaw the process if necessary but there is a window between this check 
and actually signaling the process. It can occur that the selected process is 
frozen before the signal is delivered and thus the process does not die, and 
therefore does not release its memory.
{noformat}
[idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of 
memory: Kill process 29578 (python2.6) score 413 or sacrifice child
Aug  8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of 
memory: Kill process 29578 (python2.6) score 413 or sacrifice child

[idownes@hostname ~]$ cat /proc/29578/status | grep State
State:  D (disk sleep)

[idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
SigPnd: 0000000000000100

[idownes@hostname ~]$ sudo cat /proc/29578/stack
[sudo] password for idownes:
[<ffffffff8107b17d>] __refrigerator+0xc9/0x125
[<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
[<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
[<ffffffff81002217>] do_signal+0x49/0x5ec
[<ffffffff810027e3>] do_notify_resume+0x29/0x65
[<ffffffff814abd12>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}

5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.

6. Meanwhile, another process in the cgroup tries to allocate a page, this 
triggers a page fault which fails because the cgroup is at its limit (the 
selected process hasn't been killed to free memory). This process keeps 
faulting and repeatedly jumps between runnable/running and uninterruptible. 
Regardless, it does not get frozen.
{noformat}
[idownes@hostname ~]$ sudo cat /proc/29603/stack
[<ffffffff81065246>] __cond_resched+0x26/0x30
[<ffffffff81125abd>] vmpressure+0x3b/0x7c
[<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
[<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
[<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
[<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
[<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
[<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
[<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
[<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
[<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
[<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
[<ffffffff814a82ba>] do_page_fault+0xe/0x10
[<ffffffff814a53f2>] page_fault+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
{noformat}

7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up and 
the destroy is discarded.

The cgroup can be unstuck by either a) killing the faulting process, which lets 
the cgroup fully freeze, or b) THAWing the cgroup which delivers the SIGKILL to 
the selected process, freeing memory for the faulting process to continue so 
the next freeze attempt succeeds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to