[jira] [Commented] (CLOUDSTACK-3954) HA with Security Groups and ping disabled will cause split-brian

Lennert den Teuling (JIRA) Mon, 05 Aug 2013 04:51:58 -0700

    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729422#comment-13729422
 ]


Lennert den Teuling commented on CLOUDSTACK-3954:
-------------------------------------------------

Hi Koushik,

Of course: http://pastebin.com/8cXCagGT

The difference between the 2 issues is that for CLOUDSTACK-3535 when you pull 
the plug, the UserVmDomRInvestigator will return NULL because it cannot reach 
the agent _and_ the hypervisor can't be pinged. In this case, the code states 
nothing should happen (i would not agree, but that's not the issue here). 

With this issue, the host is pingable so the HA process continues. The HA 
manager will conduct a VM ping which fails, and will do a filesystem healtcheck 
which also fails cause the agent is not running. After that the process will 
ask the KVMFencer to Fence the VM, which he says he succeeded.

Why the fencer states the VM has been fenced successfully is not totally clear 
to me. Maybe the fencer is also part of the problem, cause it is the last thing 
that can prevent the restart. 

2013-08-05 11:43:54,408 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:work-104) Fencer KVMFenceBuilder returned true

I think you can come up with serveral reasons why a VM is not pingable, so this 
cannot be the only thing that keeps the HA manager from restarting the VM on 
another host. Maybe it's needed to totally drop idea of pinging VMs cause it's 
far from reliable. 

In my opinion, if the hypervisor shows any sign of life, you cannot restart the 
VM on another host because you can never make sure if the VM is not running. If 
you are in this state, the host needs to be turned of (trough IPMI for example) 
before the HA process continue. 

                
> HA with Security Groups and ping disabled will cause split-brian
> ----------------------------------------------------------------
>
>                 Key: CLOUDSTACK-3954
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3954
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: KVM
>    Affects Versions: 4.1.0
>         Environment: Tested this with CS 4.1 on Ubuntu, but will probably 
> exist in other versions
>            Reporter: Lennert den Teuling
>            Assignee: Koushik Das
>            Priority: Critical
>             Fix For: 4.2.0
>
>
> We found out that when running CS 4.1 on KVM with Security Groups enabled + 
> ping disabled (default) will cause a split-brain when agent crashes. 
> How to reproduce:
> 1. Setup a Basic Zone with SG enabled
> 2. Create one or multiple  HA-enabled VMs with a security group which does 
> not allow ping (by default). 
> 3. Kill the agent on one of the hosts
> When you do this, the HA component on the management server will restart all 
> VMs on another node, even when they are running and the VM host is still 
> pingable. This will likely corrupt all VMs on the host where the agent was 
> stopped/killed. 
> We had some issues with libvirt causing the agent to disconnect. Luckily some 
> VMs allowed ping so nothing bad happened.  
> Temporary fix:
> Ensure at least one of the running VMs on each hosts allows ping, so the HA 
> manager will be able to ping it and will not HA the host. 
> I'm not sure yet why this happens, but wanted to file this bug so people can 
> take necessary preparations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CLOUDSTACK-3954) HA with Security Groups and ping disabled will cause split-brian

Reply via email to