[
https://issues.apache.org/jira/browse/CLOUDSTACK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13730494#comment-13730494
]
Lennert den Teuling commented on CLOUDSTACK-3954:
-------------------------------------------------
Hi Koushik,
It's true that it only happends when the agent crashes or you kill it. If you
shut it down properly there is no HA cause another disconnect status is send to
the manager. This is right, cause you should be able to restart the agent for
example without triggering HA.
I think the main concern here is that besides the agent there is no second
method to verify if a VM is running. So when the agent crashes, the heartbeat
for the fencer will also stop working because it is the same process. I don't
see the VM ping check as a valid second way to determine if a VM is running
cause there could be nummerous reason why a VM is not pingable (SG, Firewall on
the VM etc).
Currently the agent is quite stable, but there could be a lot of (future)
reasons why it could crash. If it crashes the split-brain will occur.
So I think we should replace the VM ping check with something more reliable.
For example my colleague is currently working on a simple webservice which runs
as a seperate process on the host and can tell the management service if VMs
are running on a host even when the agent is not running. This way you always
have 2 ways of knowing if a VM is alive, and is way more reliable than a ping.
Do you see this as a solution?
> HA with Security Groups and ping disabled will cause split-brian
> ----------------------------------------------------------------
>
> Key: CLOUDSTACK-3954
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3954
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: KVM
> Affects Versions: 4.1.0
> Environment: Tested this with CS 4.1 on Ubuntu, but will probably
> exist in other versions
> Reporter: Lennert den Teuling
> Priority: Critical
> Fix For: 4.2.0
>
>
> We found out that when running CS 4.1 on KVM with Security Groups enabled +
> ping disabled (default) will cause a split-brain when agent crashes.
> How to reproduce:
> 1. Setup a Basic Zone with SG enabled
> 2. Create one or multiple HA-enabled VMs with a security group which does
> not allow ping (by default).
> 3. Kill the agent on one of the hosts
> When you do this, the HA component on the management server will restart all
> VMs on another node, even when they are running and the VM host is still
> pingable. This will likely corrupt all VMs on the host where the agent was
> stopped/killed.
> We had some issues with libvirt causing the agent to disconnect. Luckily some
> VMs allowed ping so nothing bad happened.
> Temporary fix:
> Ensure at least one of the running VMs on each hosts allows ping, so the HA
> manager will be able to ping it and will not HA the host.
> I'm not sure yet why this happens, but wanted to file this bug so people can
> take necessary preparations.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira