Currently the way Cloudstack deals with PS failure is to reboot all hosts associated with the cluster. Selectively cleaning up the affected VMs would have been the best option, but since issues were seen with stopping VMs on the hypervisors (at least in Xenserver 5.6 [1]) reboot was the next option. The down side with this approach is if there are more than one PS in the cluster then healthy VMs will unnecessarily get affected due to host reboots.
Recently I tried this scenario using both XS 6.1 and 6.2. On 6.1 I think the behaviour is similar to 5.6, if the PS is not available then any operation the VM like shutdown would hang (waited for more than 30 mins and the operation was still stuck). But on 6.2 looks like these scenarios are handled more gracefully. In 6.2 on doing a shutdown the VMs power state was changed to 'halted' and then it was possible to even destroy the VM. Based on this I think that at least for XS 6.2 we can do a selective VM cleanup if the PS is not available. For older XS version the existing approach would still be used. Thoughts/comments? Also for KVM the same approach is used. Can someone let me know if newer versions of KVM can handle primary storage failure in a better way wrt to VM operations? In that case for KVM also the behaviour can be changed. For Vmware since it is an externally managed cluster I don't think this issue exists. Thanks, Koushik [1] https://issues.apache.org/jira/browse/CLOUDSTACK-3367 [2] http://comments.gmane.org/gmane.comp.apache.cloudstack.user/4254
