CC'in Kelvin on this so that he can perhaps provide an opinion.
On Tue, Feb 26, 2013 at 06:23:32PM +0000, Musayev, Ilya wrote: > Dear CS Dev Community, > > Please confirm this issue qualifies as blocker and what can be done about > this issue. > > Thanks > ilya > > From: Musayev, Ilya > Sent: Tuesday, February 26, 2013 12:00 PM > To: Musayev, Ilya; kelven.y...@citrix.com; > cloudstack-dev@incubator.apache.org; cloudstack-us...@incubator.apache.org > Subject: RE: Issues when vCenter becomes unavailable > > FYI, please note this JIRA Issue, if there is something I left out, please > chime in. > > Thanks > ilya > > https://issues.apache.org/jira/browse/CLOUDSTACK-1411 > > > > From: Musayev, Ilya > Sent: Saturday, February 23, 2013 6:22 PM > To: kelven.y...@citrix.com<mailto:kelven.y...@citrix.com>; > cloudstack-dev@incubator.apache.org<mailto:cloudstack-dev@incubator.apache.org>; > > cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org> > Subject: Re: Issues when vCenter becomes unavailable > > Any chance of some sort of fix for 4.0 or 4.1? > > I understand that CS-669 (feature/enhacement) patch missed the commit > deadline and will be in 4.2, but there is a real issue here that impacts > production now. > > Also, this is not a feature but a bug, I don't know if bugs are also treated > on the same schedule as features. > > Technically, for testing - we don't need to fail hypervisors. vMotion would > achieve the same effect and host ID will get out of sync. It's only a theory > though. > > I will open a bug request on JIRA and ask for some visibility. > > Alternatively, we can probably have a hack that will query VC for hosts and > vms, identify what's changed, and update db - I'm just trying to avoid hacks. > > Kelven Yang <kelven.y...@citrix.com<mailto:kelven.y...@citrix.com>> wrote: > This is an issue that we are targeting to solve to sync states between > vCenter/Cloudstack in a controllable way. Please track the status of this > ticket for further progress > > https://issues.apache.org/jira/browse/CLOUDSTACK-669 > > > Kelven > > > On 2/22/13 3:51 PM, "Musayev, Ilya" > <imusa...@webmd.net<mailto:imusa...@webmd.net>> wrote: > > >Abit Incomplete email as I was in train and mistakenly press send, > >correction below:.. sorry :) > > > >-----Original Message----- > >From: Musayev, Ilya [mailto:imusa...@webmd.net] > >Sent: Friday, February 22, 2013 6:49 PM > >To: > >cloudstack-dev@incubator.apache.org<mailto:cloudstack-dev@incubator.apache.org>; > >cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org> > >Cc: Kelven Yang > >Subject: RE: Issues when vCenter becomes unavailable > > > >Summary: > > > >I have 3 hypervisors > >Hypervisor 1 and 2 are down, hypervisor 3 is up. All VMs live on > >hypervisor 3, however, the host_id in instance table for the VMs are not > >being updated to reflect the only hypervisor alive. > > > >Details: > > > >I physically powered off 2 hypervisors that had most of my VMs and left 1 > >online. > > > >The VMs were brought back online by vcenter, however from then on, I > >experience what Dave and Andreas mentioned. > > > >That is, VMWare VMs instances are bound to host id (hypervisor) and not > >vcenter and operations that would be executed on the VMs require for the > >hypervisor to stay up. If the hypervisor goes off line, while VMs still > >come up in VC, CS cannot comprehend that these VMs now live on another > >hypervisor. > > > >This is bad for production roll outs - because VMs are bound to a > >hypervisor ID and not virtual center and it appears its not getting > >updated - though I do see in the log that CS is trying to find it. > > > >Did a little more digging, it looks like the host_ids don't get updated > >in mysql for vm in instances table. I need to double check on this > >because I totally messed 2 of test cloudstack clusters. > > > >Can someone do the following test - if time allows - if not - I can try > >on monday: > > > >1) Pick a hypervisor for a test crash and note 1 vm (I.e. i-2-89) > >2) Navigate to "host" table in mysql and note the host_id for hypervisor > >that is about to be powered off. > >3) In mysql goto instances table and note the last_host_id and host_id > >for a VM on test crash hypervisor. > >4) Power off the hypervisor and let VCenter bring it back online > >5) Attempt to launch a console on the VM was on crashed hypervisors and > >was powered back on by VC > >6) If it fails - as it did in my case, alter the value of host_id to a > >next hypervisor its living on (my test is not clean because I've ruined > >the cluster that hosts my console vm and don't have time now to work on > >it ATM) > >7) Launch console again to see if the issue resolved > > > >I'm under suspicion the host_id does not get updated as I witnessed by > >examining mysql instance table, but I need to fix my env issues to > >confirm. > > > >Regards > >ilya > > > > > >-----Original Message----- > >From: Chiradeep Vittal [mailto:chiradeep.vit...@citrix.com] > >Sent: Friday, February 22, 2013 3:41 PM > >To: > >cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org> > >Cc: Kelven Yang; CloudStack DeveloperList > >Subject: Re: Issues when vCenter becomes unavailable > > > >CC'ing Kelven to see if he has any ideas. > > > >On 2/22/13 12:22 PM, "Dave Dunaway" > ><dave.duna...@gmail.com<mailto:dave.duna...@gmail.com>> wrote: > > > >>If I may suggest also testing a disconnect of a host (hypervisor) from > >>vcenter, so that vcenter and CS can still talk, but vcenter cannot talk > >>to the hosts (hypervisors). CS marks the host as down or failed or > >>whatever. > >> > >>When the host comes back up vcenter can it just fine and all seems good. > >>That however is not the case (I had this with CS 3.0.5 and vmware esxi > >>5.0) > >>when CS tries to talk to vcenter and the previously disconnected host > >>(that is now recovered). > >> > >>What we experienced was that we had to migrate all guests off the > >>recovered host, and then destroy that host in CS, and re-create it. > >>Then we could migrate back onto it the guests which had been previously > >>migrated. > >> > >>The curious thing is that while CS did not want to send commands to the > >>host (it kept on saying host id=X has timedout when whatever command > >>was sent to it), CS WAS polling the host for resources and getting the > >>correct numbers.... so CS could in some ways talk to the host (ie: it > >>knew the capabilities, number of VMs on it, etc). > >> > >>Luckily for me this all happened in a test environment. In production, > >>this would have been a real nightmare! > >> > >> > >>dave > >> > >> > >>On Fri, Feb 22, 2013 at 2:48 PM, Musayev, Ilya > >><imusa...@webmd.net<mailto:imusa...@webmd.net>> > >>wrote: > >> > >>> Andi > >>> > >>> I'm on CS4.0. I simulated the VMWare VCenter 5 failure by adding a > >>>bogus IP entry in /etc/hosts for 10 minutes for virtual center host. > >>>That in turn made VC unreachable by CS. > >>> > >>> I then began executing commands and sure enough commands failed or > >>> backlogged. Once I restored VC connectivity, the backlogged commands > >>> executed and I did not experience any abnormalities. > >>> > >>> I will redo this test and leave VC off for an hour - maybe a need a > >>>longer outage. > >>> > >>> Regards > >>> ilya > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Musayev, Ilya > >>> Sent: Thursday, February 21, 2013 2:43 PM > >>> To: > >>> cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org> > >>> Subject: RE: Issues when vCenter becomes unavailable > >>> > >>> This is definitely not the behavior we want with vcenter. > >>> > >>> I will test this out on my lab setup shortly. > >>> > >>> Thanks > >>> ilya > >>> > >>> -----Original Message----- > >>> From: Chip Childers [mailto:chip.child...@sungard.com] > >>> Sent: Thursday, February 21, 2013 9:40 AM > >>> To: > >>> cloudstack-us...@incubator.apache.org<mailto:cloudstack-us...@incubator.apache.org> > >>> Subject: Re: Issues when vCenter becomes unavailable > >>> > >>> On Thu, Feb 21, 2013 at 08:59:14AM -0500, Mathias Mullins wrote: > >>> > Andreas, > >>> > > >>> > The open source community doesn't support the Citrix version 3.0.6. > >>> > You need to report this via your Citrix Support contract. Sounds > >>> > like this could be a bug. > >>> > > >>> > Community - this could be a possible issue in 4.0.0 / 4.0.1. I > >>> > don't know if this test case has been explored. > >>> > >>> Thx - I forwarded to cs-dev@i.a.o<mailto:cs-dev@i.a.o> to get the test > >>> engineers in the > >>> community to take a look. > >>> > >>> > > >>> > Thanks, > >>> > Matt Mullins > >>> > CloudPlatform Implementation Engineer Worldwide Cloud Services > >>> > Citrix System, Inc. > >>> > +1 (407) 920-1107 Office/Cell Phone > >>> > matt.mull...@citrix.com<mailto:matt.mull...@citrix.com> > >>> > > >>> > > >>> > > >>> > On 2/21/13 5:35 AM, "Fuchs, Andreas (SwissTXT)" > >>> > <andreas.fu...@swisstxt.ch<mailto:andreas.fu...@swisstxt.ch>> wrote: > >>> > > >>> > >Hi CS Users > >>> > > > >>> > >We are running CS 3.0.6 on a vSphere platform and found a strange > >>> > >behavior. > >>> > > > >>> > >When the vCenter becomes unavailable due to a reboot or some other > >>> > >issue, it seems that CS is shutting down instances when vCenter > >>> > >becomes available again. > >>> > > > >>> > >What we think what happens. > >>> > >1. vCenter becomes unrechabale > >>> > >2. CS marks the ESX servers as "down" > >>> > >3. We think this leads to: CS marks the instances as down as well 4. > >>> > >When vCenter becomes available again, CS stops the "marked as down" > >>> > >instances > >>> > > > >>> > >This is very bad as the Instances where running all the time and > >>> > >the the shutdown issued by CS is forcing a service interruption. > >>> > > > >>> > >My problem is that I cannot realy reporoduce as allot of testing > >>> > >is ongoing on the platform at the moment, so my question: > >>> > > > >>> > >Does someone else see this issue as well and can maybe reproduce? > >>> > >Is there a workaround to it, can I change some flag or something > >>> > >which tells CS to never shut down an instance by himself? > >>> > >Why are the ESX hosts getting marked as down and not unreachable > >>> > >or something? > >>> > > > >>> > >Best regards > >>> > >Andi > >>> > > >>> > > >>> > >>> > >>> > > > > > > > > > >