Sean Lair created CLOUDSTACK-10310:
--------------------------------------
Summary: KVM hosts reboot if there is a short transient storage
error
Key: CLOUDSTACK-10310
URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10310
Project: CloudStack
Issue Type: Improvement
Security Level: Public (Anyone can view this level - this is the default.)
Components: KVM
Affects Versions: 4.10.0.0, 4.9.0
Reporter: Sean Lair
If the KVM heartbeat file can't be written to, the host is rebooted, and thus
taking down all VMs running on it. The code does try 5x times before the
reboot, but the there is not a delay between the retires, so they are 5
simultaneous retries, which doesn't help. Standard SAN storage HA operations
or quick network blip could cause this reboot to occur.
Some discussions on the dev mailing list revealed that some people are just
commenting out the reboot line in their version of the CloudStack source.
A better option (and a new PR is being issued) would be have it sleep between
tries so it isn't 5x almost simultaneous tries. Plus, instead of rebooting,
the cloudstack-agent could just be stopped on the host instead. This will
cause alerts to be issued and if the host is disconnected long-enough,
depending on the HA code in use, VM HA could handle the host failure.
The built-in reboot of the host seemed drastic
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)