On Fri, Oct 17, 2014 at 9:53 AM, Jastrzebski, Michal
<michal.jastrzeb...@intel.com> wrote:
>> -----Original Message-----
>> From: Florian Haas [mailto:flor...@hastexo.com]
>> Sent: Thursday, October 16, 2014 10:53 AM
>> To: OpenStack Development Mailing List (not for usage questions)
>> Subject: Re: [openstack-dev] [Nova] Automatic evacuate
>> On Thu, Oct 16, 2014 at 9:25 AM, Jastrzebski, Michal
>> <michal.jastrzeb...@intel.com> wrote:
>> > In my opinion flavor defining is a bit hacky. Sure, it will provide us
>> > functionality fairly quickly, but also will strip us from flexibility
>> > Heat would give. Healing can be done in several ways, simple destroy
>> > -> create (basic convergence workflow so far), evacuate with or
>> > without shared storage, even rebuild vm, probably few more when we put
>> > more thoughts to it.
>> But then you'd also need to monitor the availability of *individual* guest 
>> and
>> down you go the rabbit hole.
>> So suppose you're monitoring a guest with a simple ping. And it stops
>> responding to that ping.
> I was more reffering to monitoring host (not guest), and for sure not by ping.
> I was thinking of current zookeeper service group implementation, we might 
> want
> to use corosync and write servicegroup plugin for that. There are several 
> choices
> for that, each requires testing really before we make any decission.
> There is also fencing case, which we agree is important, and I think nova 
> should
> be able to do that (since it does evacuate, it also should do a fencing). But
> for working fencing we really need working host health monitoring, so I 
> suggest
> we take baby steps here and solve one issue at the time. And that would be 
> host
> monitoring.

You're describing all of the cases for which Pacemaker is the perfect
fit. Sorry, I see absolutely no point in teaching Nova to do that.

>> (1) Has it died?
>> (2) Is it just too busy to respond to the ping?
>> (3) Has its guest network stack died?
>> (4) Has its host vif died?
>> (5) Has the L2 agent on the compute host died?
>> (6) Has its host network stack died?
>> (7) Has the compute host died?
>> Suppose further it's using shared storage (running off an RBD volume or
>> using an iSCSI volume, or whatever). Now you have almost as many recovery
>> options as possible causes for the failure, and some of those recovery
>> options will potentially destroy your guest's data.
>> No matter how you twist and turn the problem, you need strongly consistent
>> distributed VM state plus fencing. In other words, you need a full blown HA
>> stack.
>> > I'd rather use nova for low level task and maybe low level monitoring
>> > (imho nova should do that using servicegroup). But I'd use something
>> > more more configurable for actual task triggering like heat. That
>> > would give us framework rather than mechanism. Later we might want to
>> > apply HA on network or volume, then we'll have mechanism ready just
>> > monitoring hook and healing will need to be implemented.
>> >
>> > We can use scheduler hints to place resource on host HA-compatible
>> > (whichever health action we'd like to use), this will bit more
>> > complicated, but also will give us more flexibility.
>> I apologize in advance for my bluntness, but this all sounds to me like 
>> you're
>> vastly underrating the problem of reliable guest state detection and
>> recovery. :)
> Guest health in my opinion is just a bit out of scope here. If we'll have 
> robust
> way of detecting host health, we can pretty much asume that if host dies, 
> guests follow.
> There are ways to detect guest health (libvirt watchdog, ceilometer, ping you 
> mentioned),
> but that should be done somewhere else. And for sure not by evacuation.

You're making an important point here; you're asking for a "robust way
of detecting host health". I can guarantee you that the way of
detecting host health that you suggest (i.e. from within Nova) will
not be "robust" by HA standards for at least two years, if your patch
lands tomorrow.


OpenStack-dev mailing list

Reply via email to