[
https://issues.apache.org/jira/browse/MESOS-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler reassigned MESOS-662:
-------------------------------------
Assignee: Benjamin Mahler
> Executor OOM could lead to a kernel hang
> ----------------------------------------
>
> Key: MESOS-662
> URL: https://issues.apache.org/jira/browse/MESOS-662
> Project: Mesos
> Issue Type: Bug
> Reporter: Vinod Kone
> Assignee: Benjamin Mahler
> Priority: Critical
> Fix For: 0.15.0
>
>
> We observed this in production at Twitter.
> An executor OOMed and kernel put it in sleep instead of killing it because
> Mesos slave disable OOM kills. Mesos disables the kernel OOM so that it can
> take some action. The currently the only action it does is cleaning up the
> cgroup. But in the future, the action could be to increase the memory limit.
> [6290807.554028] SysRq : Show Blocked State
> [6290807.554175] task PC stack pid father
> [6290807.554251] python2.6 D ffff88097b1c3158 0 31039 1
> 0x00000000
> [6290807.554255] ffff88120ae19b48 0000000000000082 0000000000000000
> ffff88093ffffa08
> [6290807.554259] ffff88093fffed00 ffff88120ae18010 0000000000013300
> 0000000000013300
> [6290807.554263] 0000000000013300 ffff88120ae19fd8 0000000000013300
> 0000000000013300
> [6290807.554267] Call Trace:
> [6290807.554279] [<ffffffff814dfabd>] schedule+0x64/0x66
> [6290807.554285] [<ffffffff8113ad09>] mem_cgroup_handle_oom+0x132/0x21f
> [6290807.554289] [<ffffffff81138e62>] ? mem_cgroup_update_tree+0x165/0x165
> [6290807.554292] [<ffffffff8113aef5>] mem_cgroup_do_charge+0xff/0x124
> [6290807.554295] [<ffffffff8113b0ce>] __mem_cgroup_try_charge+0x1b4/0x298
> [6290807.554298] [<ffffffff8113b643>] mem_cgroup_charge_common+0x6a/0x91
> [6290807.554301] [<ffffffff8113b72f>] mem_cgroup_newpage_charge+0x23/0x25
> [6290807.554307] [<ffffffff8110c26e>] do_anonymous_page+0x169/0x29a
> [6290807.554311] [<ffffffff81110137>] handle_pte_fault+0x8d/0x1b1
> [6290807.554315] [<ffffffff8110a793>] ?
> anon_vma_interval_tree_insert+0x8a/0x8c
> [6290807.554319] [<ffffffff81113afe>] ? vma_adjust+0x50f/0x5b9
> [6290807.554324] [<ffffffff811a196d>] ? ext3_dx_readdir+0x181/0x1d7
> [6290807.554327] [<ffffffff81110489>] handle_mm_fault+0x22e/0x248
> [6290807.554332] [<ffffffff814e3c6a>] do_page_fault+0x367/0x3ae
> [6290807.554335] [<ffffffff811149f4>] ? do_brk+0x291/0x2f2
> [6290807.554339] [<ffffffff81141289>] ? __fput+0x1e7/0x1f6
> [6290807.554342] [<ffffffff814e0ba5>] page_fault+0x25/0x30
> A short term solution is to enable kernel OOM kill in cgroups (until we get
> around to adding support for soft memory limits in the cgroups isolator). The
> slave should still get a OOM notification and properly inform the frameworks
> of the OOM. One concern is that we don't know if kernel handling OOM would
> cause problems with cgroups cleanup done by the slave.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira