On 25/02/15 20:18, Joe Gordon wrote:
> 
> 
> On Fri, Feb 20, 2015 at 3:48 AM, Matthew Booth <mbo...@redhat.com
> <mailto:mbo...@redhat.com>> wrote:
> 
>     Gary Kotton came across a doozy of a bug recently:
> 
>     https://bugs.launchpad.net/nova/+bug/1419785
> 
>     In short, when you start a Nova compute, it will query the driver for
>     instances and compare that against the expected host of the the instance
>     according to the DB. If the driver is reporting an instance the DB
>     thinks is on a different host, it assumes the instance was evacuated
>     while Nova compute was down, and deletes it on the hypervisor. However,
>     Gary found that you trigger this when starting up a backup HA node which
>     has a different `host` config setting. i.e. You fail over, and the first
>     thing it does is delete all your instances.
> 
>     Gary and I both agree on a couple of things:
> 
>     1. Deleting all your instances is bad
>     2. HA nova compute is highly desirable for some drivers
> 
> 
> There is a deeper issue here, that we are trying to work around.  Nova
> was never designed to have entire systems running behind a nova-compute.
> It was designed to have one nova-compute per 'physical box that runs
> instances'
> 
> There have been many discussions in the past on how to fix this issue
> (by adding a new point in nova where clustered systems can plug in), but
> if I remember correctly the gotcha was no one was willing to step up to
> do it.

There are 2 unrelated concepts of clusters here. The VMware driver has
both, which seems to result in some confusion. As it happens, this issue
doesn't relate to either of them.

Firstly, there's a VMware cluster. This presents itself, and is managed
as one, single hypervisor. The only issue Nova has with VMware clusters
is in resource tracker, because its resources aren't contiguous. i.e.
It's an accounting issue. It would be good to have a solution to this,
but it doesn't seem to be causing many real world problems.

Secondly there's the concept of 'nodes', whereby 1 nova compute can
manage multiple hypervisors. On VMware this means managing multiple
clusters, because 1 VMware cluster == 1 hypervisor. Both the Ironic and
VMware drivers can do this.

This issue relates to the co-location of nova compute with the
hypervisor. In the case of both Ironic and VMware, it is not possible to
co-locate nova compute with the hypervisor. That means that nova compute
must exist separately and be pointed at the hypervisor, which raises the
possibility that 2 different nova computes might accidentally be pointed
at the same hypervisor. As Gary discovered, this makes bad things
happen. Note that no clusters of either kind described above are
required to trigger this bug.

I have a new patch for this here, btw:
https://review.openstack.org/#/c/158269/ . I'd be grateful for more eyes
on it.

Thanks,

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to