On Thu, Oct 16, 2014 at 1:59 PM, Russell Bryant <[email protected]> wrote: > On 10/16/2014 04:29 AM, Florian Haas wrote: >>>>>> (5) Let monitoring and orchestration services deal with these use >>>>>> cases and >>>>>> have Nova simply provide the primitive API calls that it already does >>>>>> (i.e. >>>>>> host evacuate). >>>>> >>>>> That would arguably lead to an incredible amount of wheel reinvention >>>>> for node failure detection, service failure detection, etc. etc. >>>> >>>> How so? (5) would use existing wheels for monitoring and orchestration >>>> instead of writing all new code paths inside Nova to do the same thing. >>> >>> Right, there may be some confusion here ... I thought you were both >>> agreeing that the use of an external toolset was a good approach for the >>> problem, but Florian's last message makes that not so clear ... >> >> While one of us (Jay or me) speaking for the other and saying we agree >> is a distributed consensus problem that dwarfs the complexity of >> Paxos, *I* for my part do think that an "external" toolset (i.e. one >> that lives outside the Nova codebase) is the better approach versus >> duplicating the functionality of said toolset in Nova. >> >> I just believe that the toolset that should be used here is >> Corosync/Pacemaker and not Ceilometer/Heat. And I believe the former >> approach leads to *much* fewer necessary code changes *in* Nova than >> the latter. > > Have you tried pacemaker_remote yet? It seems like a better choice for > this particular case, as opposed to using corosync, due to the potential > number of compute nodes.
I'll assume that you are *not* referring to running Corosync/Pacemaker on the compute nodes plus pacemaker_remote in the VMs, because doing so would blow up the separation between the cloud operator and tenant space. Running compute nodes as baremetal extensions of a different Corosync/Pacemaker cluster (presumably the one that manages the other Nova services) would potentially be an option, although vendors would need to buy into this. Ubuntu, for example, currently only ships pacemaker-remote in universe. *If* you're running pacemaker_remote on the compute node, though, that then also opens up the possibility for a compute driver to just dump the libvirt definition into a VirtualDomain Pacemaker resource, meaning with a small callout added to Nova, you could also get the virtual machine monitoring functionality. Bonus: this could eventually be extended to allow live migration of guests to other compute nodes in the same cluster, in case you want to shut down a compute node for maintenance without interrupting your HA guests. Cheers, Florian _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
