On 09/16/2013 01:21 PM, Koushik Das wrote:
Currently the way Cloudstack deals with PS failure is to reboot all hosts
associated with the cluster. Selectively cleaning up the affected VMs would
have been the best option, but since issues were seen with stopping VMs on the
hypervisors (at least in Xenserver 5.6 [1]) reboot was the next option. The
down side with this approach is if there are more than one PS in the cluster
then healthy VMs will unnecessarily get affected due to host reboots.
Recently I tried this scenario using both XS 6.1 and 6.2. On 6.1 I think the
behaviour is similar to 5.6, if the PS is not available then any operation the
VM like shutdown would hang (waited for more than 30 mins and the operation was
still stuck). But on 6.2 looks like these scenarios are handled more
gracefully. In 6.2 on doing a shutdown the VMs power state was changed to
'halted' and then it was possible to even destroy the VM. Based on this I think
that at least for XS 6.2 we can do a selective VM cleanup if the PS is not
available. For older XS version the existing approach would still be used.
Thoughts/comments?
Also for KVM the same approach is used. Can someone let me know if newer
versions of KVM can handle primary storage failure in a better way wrt to VM
operations? In that case for KVM also the behaviour can be changed.
I can't comment on this specifically, but when you are using NFS your
Qemu process will go into status "D" and can't be killed.
So that will lead to the only other option: Reboot the host
With NFS though, you can stop the NFS server and bring it back 2 hours
later and with KVM all the VMs will recover within 15 min without any
intervention.
CS shouldn't start doing all kinds of things when PS fails.
Wido
For Vmware since it is an externally managed cluster I don't think this issue
exists.
Thanks,
Koushik
[1] https://issues.apache.org/jira/browse/CLOUDSTACK-3367
[2] http://comments.gmane.org/gmane.comp.apache.cloudstack.user/4254