Re: Primary store (PS) failure/outage

Ove Ewerlid Mon, 16 Sep 2013 07:22:06 -0700

The PS behaviour of NFS hits an install with predominant local storagefor production badly. Why use NFS primary storage at all in such aninstall? The SSVM will not start in a zone with only local storage,additional types of primary storage needs to be added and a dummyprimary NFS share is easy to add.

A primary "dummy" NFS share can be removed after SSVM has started toavoid issues with dummy NFS share, but that adds additional hazzlesrestarting the SSVM.


Anyone know a configuration oriented solution to this?

In this case the environment is KVM/OEL64.

/Ove

On 09/16/2013 01:54 PM, Wido den Hollander wrote:

On 09/16/2013 01:21 PM, Koushik Das wrote:

Currently the way Cloudstack deals with PS failure is to reboot all
hosts associated with the cluster. Selectively cleaning up the
affected VMs would have been the best option, but since issues were
seen with stopping VMs on the hypervisors (at least in Xenserver 5.6
[1]) reboot was the next option. The down side with this approach is
if there are more than one PS in the cluster then healthy VMs will
unnecessarily get affected due to host reboots.

Recently I tried this scenario using both XS 6.1 and 6.2. On 6.1 I
think the behaviour is similar to 5.6, if the PS is not available then
any operation the VM like shutdown would hang (waited for more than 30
mins and the operation was still stuck). But on 6.2 looks like these
scenarios are handled more gracefully. In 6.2 on doing a shutdown the
VMs power state was changed to 'halted' and then it was possible to
even destroy the VM. Based on this I think that at least for XS 6.2 we
can do a selective VM cleanup if the PS is not available. For older XS
version the existing approach would still be used.

Thoughts/comments?

Also for KVM the same approach is used. Can someone let me know if
newer versions of KVM can handle primary storage failure in a better
way wrt to VM operations? In that case for KVM also the behaviour can
be changed.


I can't comment on this specifically, but when you are using NFS your
Qemu process will go into status "D" and can't be killed.

So that will lead to the only other option: Reboot the host

With NFS though, you can stop the NFS server and bring it back 2 hours
later and with KVM all the VMs will recover within 15 min without any
intervention.

CS shouldn't start doing all kinds of things when PS fails.

Wido

For Vmware since it is an externally managed cluster I don't think
this issue exists.

Thanks,
Koushik

[1] https://issues.apache.org/jira/browse/CLOUDSTACK-3367
[2] http://comments.gmane.org/gmane.comp.apache.cloudstack.user/4254



--
Ove Everlid
System Administrator / Architect / SDN- & Automation- & Linux-hacker
Mobile: +46706662363 (dedicated work mobile)
Country: Sweden, timezone; Middle Europan Time (MET or GMT+1)

Re: Primary store (PS) failure/outage

Reply via email to