[DISCUSS] OOM killer and Routing/System VM's = :(

Roeland Kuipers Wed, 04 Sep 2013 11:25:58 -0700

Hi Dev!

We have experienced a serious customers outage due to the OOM killer on a 
redundant routing vm pair member. Somehow the MASTER node ran Out of Memory and 
the OOM killer decided to kill random processes causing HAproxy to go down. But 
since keepalived was still running and functioning, a failover never happened. 
In our experience we rather panic on OOM instead of praying that the OOM-killer 
will do the right thing while it in 99% percent of the cases it just renders a 
machine useless. 
If this RvR would have panicked and rebooted we would have had a nice 
keepalived failure/failover without much impact on our customer.


So we figured to configure the following sysctl options:
        vm.panic_on_oom = 1
        kernel.panic_on_oops = 1
        kernel.panic = 10

So that a VM panics and reboots after 10 seconds so a router just comes back in 
a happy state versus crippled by the OOM killer.

But we hit a problem here with VPC routers as their configuration is not 
persistent across reboots when they are rebooted outside cloudstack as they are 
not configured (entirely) using kernel parameters (/var/cache/cloud/cmdline). 
But only when started by Cloudstack.

It would be nice to see that the VPC router config is persistent across reboots 
even when rebooted outside cloudstack and using the same mechanism as the other 
system vm's to make things more consistent and reliable.

What is your opinion on this? Otherwise will add it to our backlog to 
contribute improvements in this area.

See also:

https://issues.apache.org/jira/browse/CLOUDSTACK-4605
https://issues.apache.org/jira/browse/CLOUDSTACK-4606
https://issues.apache.org/jira/browse/CLOUDSTACK-4607


Thanks & Cheers,
Roeland Kuipers

[DISCUSS] OOM killer and Routing/System VM's = :(

Reply via email to