Some sort of fencing independent of the management server is
definitely needed.  HA in general (particularly on KVM) is all kinds
of unpredictable/buggy right now.

I like the idea of having a switch that an admin can flip to stop HA.
In fact I think a better job control system in general (e.g., being
able to stop/restart/manually start tasks) would be awesome, if it's
feasible.

Thank You,

Logan Barfield
Tranquil Hosting


On Mon, Feb 16, 2015 at 10:05 AM, Wido den Hollander <w...@widodh.nl> wrote:
>
>
> On 16-02-15 13:16, Andrei Mikhailovsky wrote:
>> I had similar issues at least two or thee times. The host agent would 
>> disconnect from the management server. The agent would not connect back to 
>> the management server without manual intervention, however, it would happily 
>> continue running the vms. The management server would initiate the HA and 
>> fire up vms, which are already running on the disconnected host. I ended up 
>> with a handful of vms and virtual routers being ran on two hypervisors, thus 
>> corrupting the disk and having all sorts of issues ((( .
>>
>> I think there has to be a better way of dealing with this case. At least on 
>> an image level. Perhaps a host should keep some sort of lock file or a file 
>> for every image where it would record a time stamp. Something like:
>>
>> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and
>> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp
>>
>> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image 
>> and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp.
>>
>> The hypervisor should record the time stamp in this file while the vm is 
>> running. Let's say every 5-10 seconds. If the timestamp is old, we can 
>> assume that the volume is no longer used by the hypervisor.
>>
>> When a vm is started, the timestamp file should be checked and if the 
>> timestamp is recent, the vm should not start, otherwise, the vm should start 
>> and the timestamp file should be regularly updated.
>>
>> I am sure there are better ways of doing this, but at least this method 
>> should not allow two vms running on different hosts to use the same volume 
>> and corrupt the data.
>>
>> In ceph, as far as I remember, a new feature is being developed to provide a 
>> locking mechanism of an rbd image. Not sure if this will do the job?
>>
>
> Something like this is still on my wishlist for Ceph/RBD, something like
> you propose.
>
> For NFS we currently have this in place, but for Ceph/RBD we don't. It's
> a matter of code in the Agent and the investigators inside the
> Management Server which decide if HA should kick in.
>
> Wido
>
>> Andrei
>>
>> ----- Original Message -----
>>
>>> From: "Wido den Hollander" <w...@widodh.nl>
>>> To: dev@cloudstack.apache.org
>>> Sent: Monday, 16 February, 2015 11:32:13 AM
>>> Subject: Re: Disable HA temporary ?
>>
>>> On 16-02-15 11:00, Andrija Panic wrote:
>>>> Hi team,
>>>>
>>>> I just had funny behaviour few days ago - one of my hosts was under
>>>> heavy
>>>> load (some disk/network load) and it went disconnected from MGMT
>>>> server.
>>>>
>>>> Then MGMT server stared doing HA thing, but without being able to
>>>> make sure
>>>> that the VMs on the disconnected hosts are really shutdown (and
>>>> they were
>>>> NOT).
>>>>
>>>> So MGMT started again some VMs on other hosts, thus resulting in
>>>> having 2
>>>> copies of the same VM, using shared strage - so corruption happened
>>>> on the
>>>> disk.
>>>>
>>>> Is there a way to temporary disable HA feature on global level, or
>>>> anything
>>>> similar ?
>>
>>> Not that I'm aware of, but this is something I also ran in to a
>>> couple
>>> of times.
>>
>>> It would indeed be nice if there could be a way to stop the HA
>>> process
>>> completely as an Admin.
>>
>>> Wido
>>
>>>> Thanks
>>>>
>>

Reply via email to