I agree...and understand :) But would this means, that VMs will not be provisioned anywhere during HA kicking in ? I guess so... So I avoid having started another copy of the same VM, that is alrady running on disconnected hosts - I need this as the temporary solution, during CEPH backfilling, so not sure if this heavy hack is good , or will case me even more trouble...
cheers On 16 February 2015 at 16:58, Logan Barfield <lbarfi...@tqhosting.com> wrote: > Hi Andrija, > > The way I understand it (and have seen in practice) is that by default > the MGMT server will use any available server for HA. Setting the HA > tag on a hosts just dedicates that host to HA, meaning that during > normal provisioning no VMs will use that host, it will only be used > for HA purposes. In other words, the "HA" tag is not required for HA > to work. > > Thank You, > > Logan Barfield > Tranquil Hosting > > > On Mon, Feb 16, 2015 at 10:43 AM, Andrija Panic <andrija.pa...@gmail.com> > wrote: > > Seems to me, that I'm about to issue something similar to: update > > cloud.vm_instance set ha = 0 where ha =1... > > > > Now seriously, wondering, per the manual - if you define HA host tag on > the > > global config level, and then have NO hosts with that tag - MGMT will not > > be able to start VMs on other hosts, since there are no hosts that are > > dedicated got HA destination ? > > > > Does this makes sense ? I guess the VMs will be just marked as Stopped in > > the GUI/databse, but unable to start them... > > Stupid proposal, but... ? > > > > On 16 February 2015 at 16:22, Logan Barfield <lbarfi...@tqhosting.com> > > wrote: > > > >> Some sort of fencing independent of the management server is > >> definitely needed. HA in general (particularly on KVM) is all kinds > >> of unpredictable/buggy right now. > >> > >> I like the idea of having a switch that an admin can flip to stop HA. > >> In fact I think a better job control system in general (e.g., being > >> able to stop/restart/manually start tasks) would be awesome, if it's > >> feasible. > >> > >> Thank You, > >> > >> Logan Barfield > >> Tranquil Hosting > >> > >> > >> On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <w...@widodh.nl> > >> wrote: > >> > > >> > > >> > On 16-02-15 13:16, Andrei Mikhailovsky wrote: > >> >> I had similar issues at least two or thee times. The host agent would > >> disconnect from the management server. The agent would not connect back > to > >> the management server without manual intervention, however, it would > >> happily continue running the vms. The management server would initiate > the > >> HA and fire up vms, which are already running on the disconnected host. > I > >> ended up with a handful of vms and virtual routers being ran on two > >> hypervisors, thus corrupting the disk and having all sorts of issues > ((( . > >> >> > >> >> I think there has to be a better way of dealing with this case. At > >> least on an image level. Perhaps a host should keep some sort of lock > file > >> or a file for every image where it would record a time stamp. Something > >> like: > >> >> > >> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and > >> >> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp > >> >> > >> >> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the > disk > >> image and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's > >> time stamp. > >> >> > >> >> The hypervisor should record the time stamp in this file while the vm > >> is running. Let's say every 5-10 seconds. If the timestamp is old, we > can > >> assume that the volume is no longer used by the hypervisor. > >> >> > >> >> When a vm is started, the timestamp file should be checked and if the > >> timestamp is recent, the vm should not start, otherwise, the vm should > >> start and the timestamp file should be regularly updated. > >> >> > >> >> I am sure there are better ways of doing this, but at least this > method > >> should not allow two vms running on different hosts to use the same > volume > >> and corrupt the data. > >> >> > >> >> In ceph, as far as I remember, a new feature is being developed to > >> provide a locking mechanism of an rbd image. Not sure if this will do > the > >> job? > >> >> > >> > > >> > Something like this is still on my wishlist for Ceph/RBD, something > like > >> > you propose. > >> > > >> > For NFS we currently have this in place, but for Ceph/RBD we don't. > It's > >> > a matter of code in the Agent and the investigators inside the > >> > Management Server which decide if HA should kick in. > >> > > >> > Wido > >> > > >> >> Andrei > >> >> > >> >> ----- Original Message ----- > >> >> > >> >>> From: "Wido den Hollander" <w...@widodh.nl> > >> >>> To: dev@cloudstack.apache.org > >> >>> Sent: Monday, 16 February, 2015 11:32:13 AM > >> >>> Subject: Re: Disable HA temporary ? > >> >> > >> >>> On 16-02-15 11:00, Andrija Panic wrote: > >> >>>> Hi team, > >> >>>> > >> >>>> I just had funny behaviour few days ago - one of my hosts was under > >> >>>> heavy > >> >>>> load (some disk/network load) and it went disconnected from MGMT > >> >>>> server. > >> >>>> > >> >>>> Then MGMT server stared doing HA thing, but without being able to > >> >>>> make sure > >> >>>> that the VMs on the disconnected hosts are really shutdown (and > >> >>>> they were > >> >>>> NOT). > >> >>>> > >> >>>> So MGMT started again some VMs on other hosts, thus resulting in > >> >>>> having 2 > >> >>>> copies of the same VM, using shared strage - so corruption happened > >> >>>> on the > >> >>>> disk. > >> >>>> > >> >>>> Is there a way to temporary disable HA feature on global level, or > >> >>>> anything > >> >>>> similar ? > >> >> > >> >>> Not that I'm aware of, but this is something I also ran in to a > >> >>> couple > >> >>> of times. > >> >> > >> >>> It would indeed be nice if there could be a way to stop the HA > >> >>> process > >> >>> completely as an Admin. > >> >> > >> >>> Wido > >> >> > >> >>>> Thanks > >> >>>> > >> >> > >> > > > > > > > > -- > > > > Andrija Panić > -- Andrija Panić