Hi Dev!
We have experienced a serious customers outage due to the OOM killer on a
redundant routing vm pair member. Somehow the MASTER node ran Out of Memory and
the OOM killer decided to kill random processes causing HAproxy to go down. But
since keepalived was still running and functioning, a failover never happened.
In our experience we rather panic on OOM instead of praying that the OOM-killer
will do the right thing while it in 99% percent of the cases it just renders a
machine useless.
If this RvR would have panicked and rebooted we would have had a nice
keepalived failure/failover without much impact on our customer.
So we figured to configure the following sysctl options:
vm.panic_on_oom = 1
kernel.panic_on_oops = 1
kernel.panic = 10
So that a VM panics and reboots after 10 seconds so a router just comes back in
a happy state versus crippled by the OOM killer.
But we hit a problem here with VPC routers as their configuration is not
persistent across reboots when they are rebooted outside cloudstack as they are
not configured (entirely) using kernel parameters (/var/cache/cloud/cmdline).
But only when started by Cloudstack.
It would be nice to see that the VPC router config is persistent across reboots
even when rebooted outside cloudstack and using the same mechanism as the other
system vm's to make things more consistent and reliable.
What is your opinion on this? Otherwise will add it to our backlog to
contribute improvements in this area.
See also:
https://issues.apache.org/jira/browse/CLOUDSTACK-4605
https://issues.apache.org/jira/browse/CLOUDSTACK-4606
https://issues.apache.org/jira/browse/CLOUDSTACK-4607
Thanks & Cheers,
Roeland Kuipers