Could some of you please review hotfix/4.4/CLOUDSTACK-7184 I could do with some advice on how to thoroughly test this functionality as well.
thanks, Daan On Tue, Sep 9, 2014 at 10:22 PM, ASF subversion and git services (JIRA) < j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127488#comment-14127488 > ] > > ASF subversion and git services commented on CLOUDSTACK-7184: > ------------------------------------------------------------- > > Commit 7f97bf7c58a348e684f7138c93c287a3c86ca55b in cloudstack's branch > refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn] > [ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=7f97bf7 ] > > CLOUDSTACK-7184: xenheartbeat gets passed timeout and interval > > > HA should wait for at least 'xen.heartbeat.interval' sec before starting > HA on vm's when host is marked down > > > ------------------------------------------------------------------------------------------------------------ > > > > Key: CLOUDSTACK-7184 > > URL: > https://issues.apache.org/jira/browse/CLOUDSTACK-7184 > > Project: CloudStack > > Issue Type: Bug > > Security Level: Public(Anyone can view this level - this is the > default.) > > Components: Hypervisor Controller, Management Server, XenServer > > Affects Versions: 4.3.0, 4.4.0, 4.5.0 > > Environment: CloudStack 4.3 with XenServer 6.2 hypervisors > > Reporter: Remi Bergsma > > Assignee: Daan Hoogland > > Priority: Blocker > > > > Hypervisor got isolated for 30 seconds due to a network issue. > CloudStack did discover this and marked the host as down, and immediately > started HA. Just 18 seconds later the hypervisor returned and we ended up > with 5 vm's that were running on two hypervisors at the same time. > > This, of course, resulted in file system corruption and the loss of the > vm's. One side of the story is why XenServer allowed this to happen (will > not bother you with this one). The CloudStack side of the story: HA should > only start after at least xen.heartbeat.interval seconds. If the host is > down long enough, the Xen heartbeat script will fence the hypervisor and > prevent corruption. If it is not down long enough, nothing should happen. > > Logs (short): > > 2014-07-25 05:03:28,596 WARN [c.c.a.m.DirectAgentAttache] > (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX) > > ..... > > 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX. Starting HA on > the VMs > > ..... > > 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager > Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = > AgentDisconnected, Host id = 505, name = mccpvmXX] > > cs marks host down: 2014-07-25 05:03:31,920 > > cs marks host up: 2014-07-25 05:03:49,655 > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > -- Daan