Hello guys

we’ve deployed an environment with CloudStack 4.11.2 and KVM(CentOS7.6), and 
Ceph 13.2.5 is deployed as the primary storage.
We found some issues with the HA solution, and we are here to ask for you 
suggestions.

We’ve both enabled VM HA and Host HA feature in CloudStack, and the compute 
offering is tagged as ha.
When we try to perform a power failure test (unplug 1 node of 4), the running 
VMs on the removed node is automatically rescheduled to the other living nodes 
after 5 minutes, but all of them can not boot into the OS. We found the booting 
procedure is stuck by the IO read/write failure.



The following information is prompted after VM starts:

Generating "/run/initramfs/rdsosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report

:/#



We found this is caused by the lock on the image:
[root@cn01-nodea ~]# rbd lock list a93010b0-2be2-49bd-b25e-ec89b3a98b4b
There is 1 exclusive lock on this image.
Locker         ID                  Address
client.1164351 auto 94464726847232 10.226.16.128:0/3002249644

If we remove the lock from the image, and restart the VM under CloudStack, this 
VM will boot successfully.

We know that if we disable the Exclusive Lock feature (by setting 
rbd_default_features = 3) for Ceph would solve this problem. But we don’t think 
it’s the best solution for the HA, so could you please give us some ideas about 
how you are doing and what is the best practice for this feature?

Thanks.

Reply via email to