Hi, Sorry for the top posting but it was hard to fit my complete view inline.
I'm also thinking about a possible solution for automatic server evacuation. I see two separate sub problems of this problem: 1)compute node monitoring and fencing, 2)automatic server evacuation Compute node monitoring is currently implemented in servicegroup module of nova. As far as I understand pacemaker is the proposed solution in this thread to solve both monitoring and fencing but we tried and found out that pacemaker_remote on baremetal does not work together with fencing (yet), see . So if we need fencing then either we have to go for normal pacemaker instead of pacemaker_remote but that solution doesn't scale or we configure and call stonith directly when pacemaker detect the compute node failure. We can create a pacemaker driver for servicegroup and that driver can hide this currently missing pacemaker functionality by calling stonith directly today and remove this extra functionality as soon as pacemaker itself is capable of doing it. However this means that the service group driver has to know the stonith configuration of the compute nodes. My another concern with pacemaker is that the up state of the resource represents the compute node does not automatically mean that the nova-compute service is also up and running on that compute node. So we have to ask the deployer of the compute node to configure the nova-compute service in pacemaker in a way that the nova-compute service is a pacemaker resource tied to the compute node. Without this configuration change another possibility would be to calculate the up state of a compute service by evaluating a logical operator on a coupled set of sources (e.g. service state in DB AND pacemaker state of the node). For automatic server evacuation we need a piece of code that gets information about the state of the compute nodes periodically and calls the nova evacuation command if necessary. Today the information source of the compute node state is the servicegroup API so the evacuation engine has to be part of nova or the servicegroup API needs to be made available from outside of nova. For me adding the evacuation engine to nova looks simpler than externalizing the servicegroup API. Today the nova evacuate command expects the information about that the server is on shared storage or not. So to be able to automatically call evacuate we also need to automatically determine if the server is on shared storage or not. Also we can consider persisting some of the scheduler hints for example the group hint used by the ServerGroupAntiAffinityFilter as proposed in  The new pacemaker servicegroup driver can be implemented first then we can add the evacuation engine as a next step. I'm happy to help with the BP work and the implementation of the feature.  http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Remote/#_baremetal_remote_node_use_case  https://blueprints.launchpad.net/nova/+spec/validate-targethost-live-migration Cheers, Gibi > -----Original Message----- > From: Jastrzebski, Michal [mailto:michal.jastrzeb...@intel.com] > Sent: October 18, 2014 09:09 > To: OpenStack Development Mailing List (not for usage questions) > Subject: Re: [openstack-dev] [Nova] Automatic evacuate > > > > > -----Original Message----- > > From: Florian Haas [mailto:flor...@hastexo.com] > > Sent: Friday, October 17, 2014 1:49 PM > > To: OpenStack Development Mailing List (not for usage questions) > > Subject: Re: [openstack-dev] [Nova] Automatic evacuate > > > > On Fri, Oct 17, 2014 at 9:53 AM, Jastrzebski, Michal > > <michal.jastrzeb...@intel.com> wrote: > > > > > > > > >> -----Original Message----- > > >> From: Florian Haas [mailto:flor...@hastexo.com] > > >> Sent: Thursday, October 16, 2014 10:53 AM > > >> To: OpenStack Development Mailing List (not for usage questions) > > >> Subject: Re: [openstack-dev] [Nova] Automatic evacuate > > >> > > >> On Thu, Oct 16, 2014 at 9:25 AM, Jastrzebski, Michal > > >> <michal.jastrzeb...@intel.com> wrote: > > >> > In my opinion flavor defining is a bit hacky. Sure, it will > > >> > provide us functionality fairly quickly, but also will strip us > > >> > from flexibility Heat would give. Healing can be done in several > > >> > ways, simple destroy > > >> > -> create (basic convergence workflow so far), evacuate with or > > >> > without shared storage, even rebuild vm, probably few more when > > >> > we put more thoughts to it. > > >> > > >> But then you'd also need to monitor the availability of > > >> *individual* guest and down you go the rabbit hole. > > >> > > >> So suppose you're monitoring a guest with a simple ping. And it > > >> stops responding to that ping. > > > > > > I was more reffering to monitoring host (not guest), and for sure > > > not by > > ping. > > > I was thinking of current zookeeper service group implementation, we > > > might want to use corosync and write servicegroup plugin for that. > > > There are several choices for that, each requires testing really > > > before we > > make any decission. > > > > > > There is also fencing case, which we agree is important, and I think > > > nova should be able to do that (since it does evacuate, it also > > > should do a fencing). But for working fencing we really need working > > > host health monitoring, so I suggest we take baby steps here and > > > solve one issue at the time. And that would be host monitoring. > > > > You're describing all of the cases for which Pacemaker is the perfect > > fit. Sorry, I see absolutely no point in teaching Nova to do that. > > Here you go: https://bugs.launchpad.net/nova/+bug/1379292 > I could think of few others. Also, since servicegroup api is plugin based we > can actually use Pacemaker and connect it to nova. Afaik Pacemaker had big > scalling issues, has anyone tried pacemaker_remote at scale? > > > > > >> (1) Has it died? > > >> (2) Is it just too busy to respond to the ping? > > >> (3) Has its guest network stack died? > > >> (4) Has its host vif died? > > >> (5) Has the L2 agent on the compute host died? > > >> (6) Has its host network stack died? > > >> (7) Has the compute host died? > > >> > > >> Suppose further it's using shared storage (running off an RBD > > >> volume or using an iSCSI volume, or whatever). Now you have almost > > >> as many recovery options as possible causes for the failure, and > > >> some of those recovery options will potentially destroy your guest's > data. > > >> > > >> No matter how you twist and turn the problem, you need strongly > > >> consistent distributed VM state plus fencing. In other words, you > > >> need a full blown HA stack. > > >> > > >> > I'd rather use nova for low level task and maybe low level > > >> > monitoring (imho nova should do that using servicegroup). But I'd > > >> > use something more more configurable for actual task triggering > > >> > like heat. That would give us framework rather than mechanism. > > >> > Later we might want to apply HA on network or volume, then we'll > > >> > have mechanism ready just monitoring hook and healing will need > > >> > to be > > implemented. > > >> > > > >> > We can use scheduler hints to place resource on host > > >> > HA-compatible (whichever health action we'd like to use), this > > >> > will bit more complicated, but also will give us more flexibility. > > >> > > >> I apologize in advance for my bluntness, but this all sounds to me > > >> like you're vastly underrating the problem of reliable guest state > > >> detection and recovery. :) > > > > > > Guest health in my opinion is just a bit out of scope here. If we'll > > > have robust way of detecting host health, we can pretty much asume > > > that > > if host dies, guests follow. > > > There are ways to detect guest health (libvirt watchdog, ceilometer, > > > ping you mentioned), but that should be done somewhere else. And for > > sure not by evacuation. > > > > You're making an important point here; you're asking for a "robust way > > of detecting host health". I can guarantee you that the way of > > detecting host health that you suggest (i.e. from within Nova) will > > not be "robust" by HA standards for at least two years, if your patch lands > tomorrow. > > We won't have it in 2 years if we won't start right away. Also I can't see > why it > has to be that long. I'm not going to reinvent wheel, I was rather talking > about teaching nova to use existing software, for example pacemaker. > > > Cheers, > > Florian > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStackemail@example.com > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > _______________________________________________ > OpenStack-dev mailing list > OpenStackfirstname.lastname@example.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStackemail@example.com http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev