[
https://issues.apache.org/jira/browse/MESOS-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754389#comment-13754389
]
Eric W. Biederman commented on MESOS-662:
-----------------------------------------
On average we already have a 0.5s delay.
Right now the kernel sends the oom notification before any processes are
killed. So if we have noticed the processes has exited we have received the
OOM notification. The only reason we would not have processed the OOM
notification is if the events get processed out of order in by the libprocess
framework. Which is possible at least in principle as different posix threads
are looking at the different events, and in my playing with the code base I
know I have seen it happen in practice.
That said I think all we would need to make it 100% reliable that we seen the
oom notification is to replace:
// Stop the OOM listener if needed.
if (info->oomNotifier.isPending()) {
info->oomNotifier.discard();
}
By something like:
// Force the oom notifier to run so we know if an OOM happened.
if (info->oomNotifier.isPending()) {
info->oomNotifier.await();
}
Possibly by preceded by a trip through the event loop so we force the event.
> Executor OOM could lead to a kernel hang
> ----------------------------------------
>
> Key: MESOS-662
> URL: https://issues.apache.org/jira/browse/MESOS-662
> Project: Mesos
> Issue Type: Bug
> Reporter: Vinod Kone
> Assignee: Benjamin Mahler
> Priority: Critical
> Fix For: 0.15.0
>
>
> We observed this in production at Twitter.
> An executor OOMed and kernel put it in sleep instead of killing it because
> Mesos slave disable OOM kills. Mesos disables the kernel OOM so that it can
> take some action. The currently the only action it does is cleaning up the
> cgroup. But in the future, the action could be to increase the memory limit.
> [6290807.554028] SysRq : Show Blocked State
> [6290807.554175] task PC stack pid father
> [6290807.554251] python2.6 D ffff88097b1c3158 0 31039 1
> 0x00000000
> [6290807.554255] ffff88120ae19b48 0000000000000082 0000000000000000
> ffff88093ffffa08
> [6290807.554259] ffff88093fffed00 ffff88120ae18010 0000000000013300
> 0000000000013300
> [6290807.554263] 0000000000013300 ffff88120ae19fd8 0000000000013300
> 0000000000013300
> [6290807.554267] Call Trace:
> [6290807.554279] [<ffffffff814dfabd>] schedule+0x64/0x66
> [6290807.554285] [<ffffffff8113ad09>] mem_cgroup_handle_oom+0x132/0x21f
> [6290807.554289] [<ffffffff81138e62>] ? mem_cgroup_update_tree+0x165/0x165
> [6290807.554292] [<ffffffff8113aef5>] mem_cgroup_do_charge+0xff/0x124
> [6290807.554295] [<ffffffff8113b0ce>] __mem_cgroup_try_charge+0x1b4/0x298
> [6290807.554298] [<ffffffff8113b643>] mem_cgroup_charge_common+0x6a/0x91
> [6290807.554301] [<ffffffff8113b72f>] mem_cgroup_newpage_charge+0x23/0x25
> [6290807.554307] [<ffffffff8110c26e>] do_anonymous_page+0x169/0x29a
> [6290807.554311] [<ffffffff81110137>] handle_pte_fault+0x8d/0x1b1
> [6290807.554315] [<ffffffff8110a793>] ?
> anon_vma_interval_tree_insert+0x8a/0x8c
> [6290807.554319] [<ffffffff81113afe>] ? vma_adjust+0x50f/0x5b9
> [6290807.554324] [<ffffffff811a196d>] ? ext3_dx_readdir+0x181/0x1d7
> [6290807.554327] [<ffffffff81110489>] handle_mm_fault+0x22e/0x248
> [6290807.554332] [<ffffffff814e3c6a>] do_page_fault+0x367/0x3ae
> [6290807.554335] [<ffffffff811149f4>] ? do_brk+0x291/0x2f2
> [6290807.554339] [<ffffffff81141289>] ? __fput+0x1e7/0x1f6
> [6290807.554342] [<ffffffff814e0ba5>] page_fault+0x25/0x30
> A short term solution is to enable kernel OOM kill in cgroups (until we get
> around to adding support for soft memory limits in the cgroups isolator). The
> slave should still get a OOM notification and properly inform the frameworks
> of the OOM. One concern is that we don't know if kernel handling OOM would
> cause problems with cgroups cleanup done by the slave.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira