Re: [jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

Daan Hoogland Tue, 09 Sep 2014 13:44:12 -0700

Could some of you please review hotfix/4.4/CLOUDSTACK-7184

I could do with some advice on how to thoroughly test this functionality as
well.


thanks,
Daan


On Tue, Sep 9, 2014 at 10:22 PM, ASF subversion and git services (JIRA) <
[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127488#comment-14127488
> ]
>
> ASF subversion and git services commented on CLOUDSTACK-7184:
> -------------------------------------------------------------
>
> Commit 7f97bf7c58a348e684f7138c93c287a3c86ca55b in cloudstack's branch
> refs/heads/hotfix/4.4/CLOUDSTACK-7184 from [~dahn]
> [ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=7f97bf7 ]
>
> CLOUDSTACK-7184: xenheartbeat gets passed timeout and interval
>
> > HA should wait for at least 'xen.heartbeat.interval' sec before starting
> HA on vm's when host is marked down
> >
> ------------------------------------------------------------------------------------------------------------
> >
> >                 Key: CLOUDSTACK-7184
> >                 URL:
> https://issues.apache.org/jira/browse/CLOUDSTACK-7184
> >             Project: CloudStack
> >          Issue Type: Bug
> >      Security Level: Public(Anyone can view this level - this is the
> default.)
> >          Components: Hypervisor Controller, Management Server, XenServer
> >    Affects Versions: 4.3.0, 4.4.0, 4.5.0
> >         Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
> >            Reporter: Remi Bergsma
> >            Assignee: Daan Hoogland
> >            Priority: Blocker
> >
> > Hypervisor got isolated for 30 seconds due to a network issue.
> CloudStack did discover this and marked the host as down, and immediately
> started HA. Just 18 seconds later the hypervisor returned and we ended up
> with 5 vm's that were running on two hypervisors at the same time.
> > This, of course, resulted in file system corruption and the loss of the
> vm's. One side of the story is why XenServer allowed this to happen (will
> not bother you with this one). The CloudStack side of the story: HA should
> only start after at least xen.heartbeat.interval seconds. If the host is
> down long enough, the Xen heartbeat script will fence the hypervisor and
> prevent corruption. If it is not down long enough, nothing should happen.
> > Logs (short):
> > 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache]
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> > .....
> > 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl]
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on
> the VMs
> > .....
> > 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event =
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> > cs marks host down: 2014-07-25  05:03:31,920
> > cs marks host up:     2014-07-25  05:03:49,655
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>



-- 
Daan

Re: [jira] [Commented] (CLOUDSTACK-7184) HA should wait for at least 'xen.heartbeat.interval' sec before starting HA on vm's when host is marked down

Reply via email to