[ 
https://issues.apache.org/jira/browse/MESOS-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1689:
---------------------------------
    Sprint: Twitter Q4 Sprint 2

> Race with kernel to kill process / destroy cgroup after OOM
> -----------------------------------------------------------
>
>                 Key: MESOS-1689
>                 URL: https://issues.apache.org/jira/browse/MESOS-1689
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Ian Downes
>
> The recently refactored cgroup::destroy code can fail to freeze a freezer 
> cgroup under a particular ordering of events, as detailed below. The 
> LinuxLauncher will fail to destroy the cgroup and other isolators will not be 
> able to destroy their cgroups.
> This failure will be logged but otherwise ignored by a running slave. If the 
> slave is subsequently restarted it will block on the cgroup::destroy during 
> launcher recovery, timing out after 60 seconds and causing recovery to fail, 
> which will then cause the slave to terminate. If the slave is monitored and 
> automatically restarted it will repeatedly flap.
> The problem appears as a container freezer cgroup that will not transition to 
> FROZEN and remains in FREEZING. This is because one or more processes cannot 
> be frozen.
> {noformat}
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state
>          FREEZING
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/cgroup.procs
>   | xargs ps -L
>   PID   LWP TTY      STAT   TIME COMMAND
> 29369 29369 ?        Dsl    0:02 python2.6 ./thermos_executor
> 29369 29482 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29483 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29484 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29485 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29486 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29487 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29488 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29489 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29490 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29491 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29492 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29493 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29494 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29495 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29496 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29497 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29498 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29499 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29500 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29582 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29583 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29584 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29585 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29604 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29605 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29606 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29607 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29608 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29610 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29612 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29369 29613 ?        Dsl    0:00 python2.6 ./thermos_executor
> 29526 29526 ?        D      0:02 python2.6 
> /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29578 29578 ?        Ds    29:49 python2.6 
> /var/lib/mesos/slaves/20140729-023029-1890854154-5050-33440-3/frameworks/20110
> 29603 29603 ?        R    254:08 python2.6 /usr/local/bin/package_cache 
> 719808749561cd7e77d8a22df9f36643 hftp://hadoop-r
> {noformat}
> Debugging with [~jieyu] indicates the following sequence of events:
> 1. Cgroup reaches memory limit
> 2. Kernel notifies Mesos that an OOM condition has occured.
> 3. Mesos initiates freezing the cgroup.
> 4. There is now a race in the kernel between freezing processes and deciding 
> which process to kill. The kernel does check a process is suitable for 
> killing and will thaw the process if necessary but there is a window between 
> this check and actually signaling the process. It can occur that the selected 
> process is frozen before the signal is delivered and thus the process does 
> not die, and therefore does not release its memory.
> {noformat}
> [idownes@hostname ~]$ grep 'Kill process 29578' /var/log/kern.log
> Aug  8 01:52:05 s_all@hostname kernel: [804284.982630] Memory cgroup out of 
> memory: Kill process 29578 (python2.6) score 413 or sacrifice child
> Aug  8 01:52:05 s_all@hostname kernel: [804284.985019] Memory cgroup out of 
> memory: Kill process 29578 (python2.6) score 413 or sacrifice child
> [idownes@hostname ~]$ cat /proc/29578/status | grep State
> State:  D (disk sleep)
> [idownes@hostname ~]$ cat /proc/29578/status | grep SigPnd
> SigPnd: 0000000000000100
> [idownes@hostname ~]$ sudo cat /proc/29578/stack
> [sudo] password for idownes:
> [<ffffffff8107b17d>] __refrigerator+0xc9/0x125
> [<ffffffff8104b1fe>] try_to_freeze+0x30/0x32
> [<ffffffff8104d8e5>] get_signal_to_deliver+0x50/0x4ba
> [<ffffffff81002217>] do_signal+0x49/0x5ec
> [<ffffffff810027e3>] do_notify_resume+0x29/0x65
> [<ffffffff814abd12>] int_signal+0x12/0x17
> [<ffffffffffffffff>] 0xffffffffffffffff
> {noformat}
> 5. Mesos continues to wait for the cgroup to freeze, polling on freezer.state.
> 6. Meanwhile, another process in the cgroup tries to allocate a page, this 
> triggers a page fault which fails because the cgroup is at its limit (the 
> selected process hasn't been killed to free memory). This process keeps 
> faulting and repeatedly jumps between runnable/running and uninterruptible. 
> Regardless, it does not get frozen.
> {noformat}
> [idownes@hostname ~]$ sudo cat /proc/29603/stack
> [<ffffffff81065246>] __cond_resched+0x26/0x30
> [<ffffffff81125abd>] vmpressure+0x3b/0x7c
> [<ffffffff81125b18>] vmpressure_prio+0x1a/0x1c
> [<ffffffff810f02ce>] do_try_to_free_pages+0x71/0x31c
> [<ffffffff810f0748>] try_to_free_mem_cgroup_pages+0xb0/0x10f
> [<ffffffff81122d3d>] mem_cgroup_reclaim+0x4d/0xb2
> [<ffffffff811231ba>] __mem_cgroup_try_charge+0x2a9/0x8ff
> [<ffffffff81123c52>] mem_cgroup_charge_common+0x35/0x5d
> [<ffffffff81124dba>] mem_cgroup_newpage_charge+0x3a/0x3c
> [<ffffffff810fe48d>] handle_pte_fault+0x66f/0x749
> [<ffffffff810ff3e4>] handle_mm_fault+0xf0/0x123
> [<ffffffff814a8102>] __do_page_fault+0x1b5/0x35f
> [<ffffffff814a82ba>] do_page_fault+0xe/0x10
> [<ffffffff814a53f2>] page_fault+0x22/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> {noformat}
> 7. Freezing the cgroup cannot progress. After 60 seconds the slave gives up 
> and the destroy is discarded.
> The cgroup can be unstuck by any of the following:
> a) killing the faulting process, which lets the cgroup fully freeze
> b) THAWing the cgroup which delivers the SIGKILL to the selected process, 
> freeing memory for the faulting process to continue so the next freeze 
> attempt succeeds
> c) Increasing the memory cgroup's limit so the page fault can allocate a page
> {noformat}
> [idownes@hostname ~]$ cat 
> /sys/fs/cgroup/memory/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/memory.limit_in_bytes
> 167772160
> [root@hostname idownes]# echo 167782160 > 
> /sys/fs/cgroup/memory/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/memory.limit_in_bytes
> [root@hostname idownes]# cat 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state
> cat: 
> /sys/fs/cgroup/freezer/mesos/4c6c0bb9-fd1e-4468-9e1d-30ef383ad84a/freezer.state:
>  No such file or directory
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to