FYI, please note this JIRA Issue, if there is something I left out, please chime in.
Thanks ilya https://issues.apache.org/jira/browse/CLOUDSTACK-1411 From: Musayev, Ilya Sent: Saturday, February 23, 2013 6:22 PM To: kelven.y...@citrix.com; cloudstack-...@incubator.apache.org; cloudstack-users@incubator.apache.org Subject: Re: Issues when vCenter becomes unavailable Any chance of some sort of fix for 4.0 or 4.1? I understand that CS-669 (feature/enhacement) patch missed the commit deadline and will be in 4.2, but there is a real issue here that impacts production now. Also, this is not a feature but a bug, I don't know if bugs are also treated on the same schedule as features. Technically, for testing - we don't need to fail hypervisors. vMotion would achieve the same effect and host ID will get out of sync. It's only a theory though. I will open a bug request on JIRA and ask for some visibility. Alternatively, we can probably have a hack that will query VC for hosts and vms, identify what's changed, and update db - I'm just trying to avoid hacks. Kelven Yang <kelven.y...@citrix.com<mailto:kelven.y...@citrix.com>> wrote: This is an issue that we are targeting to solve to sync states between vCenter/Cloudstack in a controllable way. Please track the status of this ticket for further progress https://issues.apache.org/jira/browse/CLOUDSTACK-669 Kelven On 2/22/13 3:51 PM, "Musayev, Ilya" <imusa...@webmd.net<mailto:imusa...@webmd.net>> wrote: >Abit Incomplete email as I was in train and mistakenly press send, >correction below:.. sorry :) > >-----Original Message----- >From: Musayev, Ilya [mailto:imusa...@webmd.net] >Sent: Friday, February 22, 2013 6:49 PM >To: >cloudstack-...@incubator.apache.org<mailto:cloudstack-...@incubator.apache.org>; >cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org> >Cc: Kelven Yang >Subject: RE: Issues when vCenter becomes unavailable > >Summary: > >I have 3 hypervisors >Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on >hypervisor 3, however, the host_id in instance table for the VMs are not >being updated to reflect the only hypervisor alive. > >Details: > >I physically powered off 2 hypervisors that had most of my VMs and left 1 >online. > >The VMs were brought back online by vcenter, however from then on, I >experience what Dave and Andreas mentioned. > >That is, VMWare VMs instances are bound to host id (hypervisor) and not >vcenter and operations that would be executed on the VMs require for the >hypervisor to stay up. If the hypervisor goes off line, while VMs still >come up in VC, CS cannot comprehend that these VMs now live on another >hypervisor. > >This is bad for production roll outs - because VMs are bound to a >hypervisor ID and not virtual center and it appears its not getting >updated - though I do see in the log that CS is trying to find it. > >Did a little more digging, it looks like the host_ids don't get updated >in mysql for vm in instances table. I need to double check on this >because I totally messed 2 of test cloudstack clusters. > >Can someone do the following test - if time allows - if not - I can try >on monday: > >1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89) >2) Navigate to "host" table in mysql and note the host_id for hypervisor >that is about to be powered off. >3) In mysql goto instances table and note the last_host_id and host_id >for a VM on test crash hypervisor. >4) Power off the hypervisor and let VCenter bring it back online >5) Attempt to launch a console on the VM was on crashed hypervisors and >was powered back on by VC >6) If it fails - as it did in my case, alter the value of host_id to a >next hypervisor its living on (my test is not clean because I've ruined >the cluster that hosts my console vm and don't have time now to work on >it ATM) >7) Launch console again to see if the issue resolved > >I'm under suspicion the host_id does not get updated as I witnessed by >examining mysql instance table, but I need to fix my env issues to >confirm. > >Regards >ilya > > >-----Original Message----- >From: Chiradeep Vittal [mailto:chiradeep.vit...@citrix.com] >Sent: Friday, February 22, 2013 3:41 PM >To: >cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org> >Cc: Kelven Yang; CloudStack DeveloperList >Subject: Re: Issues when vCenter becomes unavailable > >CC'ing Kelven to see if he has any ideas. > >On 2/22/13 12:22 PM, "Dave Dunaway" ><dave.duna...@gmail.com<mailto:dave.duna...@gmail.com>> wrote: > >>If I may suggest also testing a disconnect of a host (hypervisor) from >>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk >>to the hosts (hypervisors). CS marks the host as down or failed or >>whatever. >> >>When the host comes back up vcenter can it just fine and all seems good. >>That however is not the case (I had this with CS 3.0.5 and vmware esxi >>5.0) >>when CS tries to talk to vcenter and the previously disconnected host >>(that is now recovered). >> >>What we experienced was that we had to migrate all guests off the >>recovered host, and then destroy that host in CS, and re-create it. >>Then we could migrate back onto it the guests which had been previously >>migrated. >> >>The curious thing is that while CS did not want to send commands to the >>host (it kept on saying host id=X has timedout when whatever command >>was sent to it), CS WAS polling the host for resources and getting the >>correct numbers.... so CS could in some ways talk to the host (ie: it >>knew the capabilities, number of VMs on it, etc). >> >>Luckily for me this all happened in a test environment. In production, >>this would have been a real nightmare! >> >> >>dave >> >> >>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya >><imusa...@webmd.net<mailto:imusa...@webmd.net>> >>wrote: >> >>> Andi >>> >>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a >>>bogus IP entry in /etc/hosts for 10 minutes for virtual center host. >>>That in turn made VC unreachable by CS. >>> >>> I then began executing commands and sure enough commands failed or >>> backlogged. Once I restored VC connectivity, the backlogged commands >>> executed and I did not experience any abnormalities. >>> >>> I will redo this test and leave VC off for an hour - maybe a need a >>>longer outage. >>> >>> Regards >>> ilya >>> >>> >>> >>> -----Original Message----- >>> From: Musayev, Ilya >>> Sent: Thursday, February 21, 2013 2:43 PM >>> To: >>> cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org> >>> Subject: RE: Issues when vCenter becomes unavailable >>> >>> This is definitely not the behavior we want with vcenter. >>> >>> I will test this out on my lab setup shortly. >>> >>> Thanks >>> ilya >>> >>> -----Original Message----- >>> From: Chip Childers [mailto:chip.child...@sungard.com] >>> Sent: Thursday, February 21, 2013 9:40 AM >>> To: >>> cloudstack-users@incubator.apache.org<mailto:cloudstack-users@incubator.apache.org> >>> Subject: Re: Issues when vCenter becomes unavailable >>> >>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote: >>> > Andreas, >>> > >>> > The open source community doesn't support the Citrix version 3.0.6. >>> > You need to report this via your Citrix Support contract. Sounds >>> > like this could be a bug. >>> > >>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I >>> > don't know if this test case has been explored. >>> >>> Thx - I forwarded to cs-dev@i.a.o<mailto:cs-dev@i.a.o> to get the test >>> engineers in the >>> community to take a look. >>> >>> > >>> > Thanks, >>> > Matt Mullins >>> > CloudPlatform Implementation Engineer Worldwide Cloud Services >>> > Citrix System, Inc. >>> > +1 (407) 920-1107 Office/Cell Phone >>> > matt.mull...@citrix.com<mailto:matt.mull...@citrix.com> >>> > >>> > >>> > >>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)" >>> > <andreas.fu...@swisstxt.ch<mailto:andreas.fu...@swisstxt.ch>> wrote: >>> > >>> > >Hi CS Users >>> > > >>> > >We are running CS 3.0.6 on a vSphere platform and found a strange >>> > >behavior. >>> > > >>> > >When the vCenter becomes unavailable due to a reboot or some other >>> > >issue, it seems that CS is shutting down instances when vCenter >>> > >becomes available again. >>> > > >>> > >What we think what happens. >>> > >1. vCenter becomes unrechabale >>> > >2. CS marks the ESX servers as "down" >>> > >3. We think this leads to: CS marks the instances as down as well 4. >>> > >When vCenter becomes available again, CS stops the "marked as down" >>> > >instances >>> > > >>> > >This is very bad as the Instances where running all the time and >>> > >the the shutdown issued by CS is forcing a service interruption. >>> > > >>> > >My problem is that I cannot realy reporoduce as allot of testing >>> > >is ongoing on the platform at the moment, so my question: >>> > > >>> > >Does someone else see this issue as well and can maybe reproduce? >>> > >Is there a workaround to it, can I change some flag or something >>> > >which tells CS to never shut down an instance by himself? >>> > >Why are the ESX hosts getting marked as down and not unreachable >>> > >or something? >>> > > >>> > >Best regards >>> > >Andi >>> > >>> > >>> >>> >>> > > > > >