[
https://issues.apache.org/jira/browse/MESOS-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325265#comment-15325265
]
Greg Mann commented on MESOS-2105:
----------------------------------
We recently observed this on an internal test cluster. An executor was
OOM-killed before the cgroup mem isolator was able to destroy the offending
container. Here are the kernel logs from the agent machine interleaved with the
Mesos agent logs:
{code}
Jun 10 16:14:47 ip-10-10-0-87 mesos-slave[3038]: I0610 16:14:47.434166 3044
mem.cpp:644] OOM detected for container d9d84892-1165-43a2-9675-10b88be141f4
Jun 10 16:14:47 ip-10-10-0-87 kernel: docker0: port 1(vethb30b136) entered
forwarding state
Jun 10 16:14:47 ip-10-10-0-87 kernel: balloon-executo invoked oom-killer:
gfp_mask=0xd0, order=0, oom_score_adj=0
Jun 10 16:14:47 ip-10-10-0-87 kernel: balloon-executo cpuset=/ mems_allowed=0
Jun 10 16:14:47 ip-10-10-0-87 kernel: CPU: 2 PID: 23924 Comm: balloon-executo
Tainted: G ------------ T 3.10.0-327.10.1.el7.x86_64 #1
Jun 10 16:14:47 ip-10-10-0-87 kernel: Hardware name: Xen HVM domU, BIOS
4.2.amazon 05/12/2016
Jun 10 16:14:47 ip-10-10-0-87 kernel: ffff8803a6463980 000000009a29939c
ffff88025f85bcd0 ffffffff816352cc
Jun 10 16:14:47 ip-10-10-0-87 kernel: ffff88025f85bd60 ffffffff8163026c
ffff8802ec7265b8 0000000000000001
Jun 10 16:14:47 ip-10-10-0-87 kernel: ffffffff00000003 fffeefff00000000
0000000000000001 ffff8803a6467803
Jun 10 16:14:47 ip-10-10-0-87 kernel: Call Trace:
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff816352cc>] dump_stack+0x19/0x1b
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff8163026c>]
dump_header+0x8e/0x214
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff8116cebe>]
oom_kill_process+0x24e/0x3b0
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff81088dae>] ?
has_capability_noaudit+0x1e/0x30
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff811d3b65>]
mem_cgroup_oom_synchronize+0x555/0x580
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff811d2f50>] ?
mem_cgroup_charge_common+0xc0/0xc0
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff8116d734>]
pagefault_out_of_memory+0x14/0x90
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff8162e69c>]
mm_fault_error+0x68/0x12b
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff816411d2>]
__do_page_fault+0x3e2/0x450
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff81641263>]
do_page_fault+0x23/0x80
Jun 10 16:14:47 ip-10-10-0-87 kernel: [<ffffffff8163d4c8>] page_fault+0x28/0x30
Jun 10 16:14:47 ip-10-10-0-87 kernel: Task in
/mesos/d9d84892-1165-43a2-9675-10b88be141f4 killed as a result of limit of
/mesos/d9d84892-1165-43a2-9675-10b88be141f4
Jun 10 16:14:47 ip-10-10-0-87 kernel: memory: usage 196608kB, limit 196608kB,
failcnt 50
Jun 10 16:14:47 ip-10-10-0-87 kernel: memory+swap: usage 196608kB, limit
9007199254740991kB, failcnt 0
Jun 10 16:14:47 ip-10-10-0-87 kernel: kmem: usage 0kB, limit
9007199254740991kB, failcnt 0
Jun 10 16:14:47 ip-10-10-0-87 kernel: Memory cgroup stats for
/mesos/d9d84892-1165-43a2-9675-10b88be141f4: cache:0KB rss:196608KB
rss_huge:188416KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:
Jun 10 16:14:47 ip-10-10-0-87 kernel: [ pid ] uid tgid total_vm rss
nr_ptes swapents oom_score_adj name
Jun 10 16:14:47 ip-10-10-0-87 kernel: [23886] 0 23886 2378 288
11 0 0 sh
Jun 10 16:14:47 ip-10-10-0-87 kernel: [23914] 0 23914 223240 52827
159 0 0 balloon-executo
Jun 10 16:14:47 ip-10-10-0-87 kernel: Memory cgroup out of memory: Kill process
23924 (balloon-executo) score 1045 or sacrifice child
Jun 10 16:14:47 ip-10-10-0-87 kernel: Killed process 23914 (balloon-executo)
total-vm:892960kB, anon-rss:196168kB, file-rss:15140kB
Jun 10 16:14:47 ip-10-10-0-87 mesos-slave[3038]: I0610 16:14:47.600641 3043
slave.cpp:3788] executor(1)@10.10.0.87:37878 exited
{code}
> Reliably report OOM even if the executor exits normally
> -------------------------------------------------------
>
> Key: MESOS-2105
> URL: https://issues.apache.org/jira/browse/MESOS-2105
> Project: Mesos
> Issue Type: Improvement
> Components: isolation
> Affects Versions: 0.20.0
> Reporter: Ian Downes
>
> Container OOMs are asynchronously reported by the kernel and the following
> sequence can occur:
> 1) Container OOMs
> 2) Kernel chooses to kill the task
> 3) Executor notices, reports TASK_FAILED, then exits
> 4) MesosContainerizer sees executor exit, *doesn't check for an OOM*, and
> destroys the container
> 5) Memory isolator may or may not have seen the OOM event but the container
> is destroyed anyway.
> The task is reported to have failed but without including the cause.
> Suggest always checking if an OOM has occurred, even if the executor exits
> normally.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)