Re: [Blocker][ACS41] Issues when vCenter becomes unavailable

Chip Childers Wed, 27 Feb 2013 12:47:19 -0800

CC'in Kelvin on this so that he can perhaps provide an opinion.


On Tue, Feb 26, 2013 at 06:23:32PM +0000, Musayev, Ilya wrote:
> Dear CS Dev Community,
> 
> Please confirm this issue qualifies as blocker and what can be done about 
> this issue.
> 
> Thanks
> ilya
> 
> From: Musayev, Ilya
> Sent: Tuesday, February 26, 2013 12:00 PM
> To: Musayev, Ilya; kelven.y...@citrix.com; 
> cloudstack-dev@incubator.apache.org; cloudstack-us...@incubator.apache.org
> Subject: RE: Issues when vCenter becomes unavailable
> 
> FYI, please note this JIRA Issue, if there is something I left out, please 
> chime in.
> 
> Thanks
> ilya
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-1411
> 
> 
> 
> From: Musayev, Ilya
> Sent: Saturday, February 23, 2013 6:22 PM
> To: kelven.y...@citrix.com<mailto:kelven.y...@citrix.com>; 
> cloudstack-dev@incubator.apache.org<mailto:cloudstack-dev@incubator.apache.org>;
>  
> cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org>
> Subject: Re: Issues when vCenter becomes unavailable
> 
> Any chance of some sort of fix for 4.0 or 4.1?
> 
> I understand that CS-669 (feature/enhacement) patch missed the commit 
> deadline and will be in 4.2, but there is a real issue here that impacts 
> production now.
> 
> Also, this is not a feature but a bug, I don't know if bugs are also treated 
> on the same schedule as features.
> 
> Technically, for testing - we don't need to fail hypervisors. vMotion would 
> achieve the same effect and host ID will get out of sync. It's only a theory 
> though.
> 
> I will open a bug request on JIRA and ask for some visibility.
> 
> Alternatively, we can probably have a hack that will query VC for hosts and 
> vms, identify what's changed, and update db - I'm just trying to avoid hacks.
> 
> Kelven Yang <kelven.y...@citrix.com<mailto:kelven.y...@citrix.com>> wrote:
> This is an issue that we are targeting to solve to sync states between
> vCenter/Cloudstack in a controllable way. Please track the status of this
> ticket for further progress
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-669
> 
> 
> Kelven
> 
> 
> On 2/22/13 3:51 PM, "Musayev, Ilya" 
> <imusa...@webmd.net<mailto:imusa...@webmd.net>> wrote:
> 
> >Abit Incomplete email as I was in train and mistakenly press send,
> >correction below:.. sorry :)
> >
> >-----Original Message-----
> >From: Musayev, Ilya [mailto:imusa...@webmd.net]
> >Sent: Friday, February 22, 2013 6:49 PM
> >To: 
> >cloudstack-dev@incubator.apache.org<mailto:cloudstack-dev@incubator.apache.org>;
> >cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org>
> >Cc: Kelven Yang
> >Subject: RE: Issues when vCenter becomes unavailable
> >
> >Summary:
> >
> >I have 3 hypervisors
> >Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on
> >hypervisor 3, however, the host_id in instance table for the VMs are not
> >being updated to reflect the only hypervisor alive.
> >
> >Details:
> >
> >I physically powered off 2 hypervisors that had most of my VMs and left 1
> >online.
> >
> >The VMs were brought back online by vcenter, however from then on, I
> >experience what Dave and Andreas mentioned.
> >
> >That is, VMWare VMs instances are bound to host id (hypervisor) and not
> >vcenter and operations that would be executed on the VMs require for the
> >hypervisor to stay up. If the hypervisor goes off line, while VMs still
> >come up in VC, CS cannot comprehend that these VMs now live on another
> >hypervisor.
> >
> >This is bad for production roll outs - because VMs are bound to a
> >hypervisor ID and not virtual center and it appears its not getting
> >updated - though I do see in the log that CS is trying to find it.
> >
> >Did a little more digging, it looks like the host_ids don't get updated
> >in mysql for vm in instances table. I need to double check on this
> >because I totally messed 2 of test cloudstack clusters.
> >
> >Can someone do the following test - if time allows - if not - I can try
> >on monday:
> >
> >1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89)
> >2) Navigate to "host" table in mysql and note the host_id for hypervisor
> >that is about to be powered off.
> >3) In mysql goto instances table and note the last_host_id and host_id
> >for a VM on test crash hypervisor.
> >4) Power off the hypervisor and let VCenter bring it back online
> >5) Attempt to launch a console on the VM was on crashed hypervisors and
> >was powered back on by VC
> >6) If it fails - as it did in my case, alter the value of host_id to a
> >next hypervisor its living on (my test is not clean because I've ruined
> >the cluster that hosts my console vm and don't have time now to work on
> >it ATM)
> >7) Launch console again to see if the issue resolved
> >
> >I'm under suspicion the host_id does not get updated as I witnessed by
> >examining mysql instance table, but I need to fix my env issues to
> >confirm.
> >
> >Regards
> >ilya
> >
> >
> >-----Original Message-----
> >From: Chiradeep Vittal [mailto:chiradeep.vit...@citrix.com]
> >Sent: Friday, February 22, 2013 3:41 PM
> >To: 
> >cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org>
> >Cc: Kelven Yang; CloudStack DeveloperList
> >Subject: Re: Issues when vCenter becomes unavailable
> >
> >CC'ing Kelven to see if he has any ideas.
> >
> >On 2/22/13 12:22 PM, "Dave Dunaway" 
> ><dave.duna...@gmail.com<mailto:dave.duna...@gmail.com>> wrote:
> >
> >>If I may suggest also testing a disconnect of a host (hypervisor) from
> >>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk
> >>to the hosts (hypervisors). CS marks the host as down or failed or
> >>whatever.
> >>
> >>When the host comes back up vcenter can it just fine and all seems good.
> >>That however is not the case (I had this with CS 3.0.5 and vmware esxi
> >>5.0)
> >>when CS tries to talk to vcenter and the previously disconnected host
> >>(that is now recovered).
> >>
> >>What we experienced was that we had to migrate all guests off the
> >>recovered host, and then destroy that host in CS, and re-create it.
> >>Then we could migrate back onto it the guests which had been previously
> >>migrated.
> >>
> >>The curious thing is that while CS did not want to send commands to the
> >>host (it kept on saying host id=X has timedout when whatever command
> >>was sent to it), CS WAS polling the host for resources and getting the
> >>correct numbers.... so CS could in some ways talk to the host (ie: it
> >>knew the capabilities, number of VMs on it, etc).
> >>
> >>Luckily for me this all happened in a test environment. In production,
> >>this would have been a real nightmare!
> >>
> >>
> >>dave
> >>
> >>
> >>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya 
> >><imusa...@webmd.net<mailto:imusa...@webmd.net>>
> >>wrote:
> >>
> >>> Andi
> >>>
> >>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a
> >>>bogus  IP entry in /etc/hosts for 10 minutes for virtual center host.
> >>>That in turn  made VC unreachable by CS.
> >>>
> >>> I then began executing commands and sure enough commands failed or
> >>> backlogged. Once I restored VC connectivity, the backlogged commands
> >>> executed and I did not experience any abnormalities.
> >>>
> >>> I will redo this test and leave VC off for an hour - maybe a need a
> >>>longer  outage.
> >>>
> >>> Regards
> >>> ilya
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Musayev, Ilya
> >>> Sent: Thursday, February 21, 2013 2:43 PM
> >>> To: 
> >>> cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org>
> >>> Subject: RE: Issues when vCenter becomes unavailable
> >>>
> >>> This is definitely not the behavior we want with vcenter.
> >>>
> >>> I will test this out on my lab setup shortly.
> >>>
> >>> Thanks
> >>> ilya
> >>>
> >>> -----Original Message-----
> >>> From: Chip Childers [mailto:chip.child...@sungard.com]
> >>> Sent: Thursday, February 21, 2013 9:40 AM
> >>> To: 
> >>> cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org>
> >>> Subject: Re: Issues when vCenter becomes unavailable
> >>>
> >>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote:
> >>> > Andreas,
> >>> >
> >>> > The open source community doesn't support the Citrix version 3.0.6.
> >>> > You need to report this via your Citrix Support contract. Sounds
> >>> > like this could be a bug.
> >>> >
> >>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I
> >>> > don't know if this test case has been explored.
> >>>
> >>> Thx - I forwarded to cs-dev@i.a.o<mailto:cs-dev@i.a.o> to get the test 
> >>> engineers in the
> >>> community to take a look.
> >>>
> >>> >
> >>> > Thanks,
> >>> > Matt Mullins
> >>> > CloudPlatform Implementation Engineer Worldwide Cloud Services
> >>> > Citrix System, Inc.
> >>> > +1 (407) 920-1107  Office/Cell Phone
> >>> > matt.mull...@citrix.com<mailto:matt.mull...@citrix.com>
> >>> >
> >>> >
> >>> >
> >>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)"
> >>> > <andreas.fu...@swisstxt.ch<mailto:andreas.fu...@swisstxt.ch>> wrote:
> >>> >
> >>> > >Hi CS Users
> >>> > >
> >>> > >We are running CS 3.0.6 on a vSphere platform and found a strange
> >>> > >behavior.
> >>> > >
> >>> > >When the vCenter becomes unavailable due to a reboot or some other
> >>> > >issue, it seems that CS is shutting down instances when vCenter
> >>> > >becomes available again.
> >>> > >
> >>> > >What we think what happens.
> >>> > >1. vCenter becomes unrechabale
> >>> > >2. CS marks the ESX servers as "down"
> >>> > >3. We think this leads to: CS marks the instances as down as well 4.
> >>> > >When vCenter becomes available again, CS stops the "marked as down"
> >>> > >instances
> >>> > >
> >>> > >This is very bad as the Instances where running all the time and
> >>> > >the the shutdown issued by CS is forcing a service interruption.
> >>> > >
> >>> > >My problem is that I cannot realy reporoduce as allot of testing
> >>> > >is ongoing on the platform at the moment, so my question:
> >>> > >
> >>> > >Does someone else see this issue as well and can maybe reproduce?
> >>> > >Is there a workaround to it, can I change some flag or something
> >>> > >which tells CS to never shut down an instance by himself?
> >>> > >Why are the ESX hosts getting marked as down and not unreachable
> >>> > >or something?
> >>> > >
> >>> > >Best regards
> >>> > >Andi
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >
> >
> >
> >
> >

Re: [Blocker][ACS41] Issues when vCenter becomes unavailable

Reply via email to