On 20/02/15 11:48, Matthew Booth wrote: > Gary Kotton came across a doozy of a bug recently: > > https://bugs.launchpad.net/nova/+bug/1419785 > > In short, when you start a Nova compute, it will query the driver for > instances and compare that against the expected host of the the instance > according to the DB. If the driver is reporting an instance the DB > thinks is on a different host, it assumes the instance was evacuated > while Nova compute was down, and deletes it on the hypervisor. However, > Gary found that you trigger this when starting up a backup HA node which > has a different `host` config setting. i.e. You fail over, and the first > thing it does is delete all your instances. > > Gary and I both agree on a couple of things: > > 1. Deleting all your instances is bad > 2. HA nova compute is highly desirable for some drivers > > We disagree on the approach to fixing it, though. Gary posted this: > > https://review.openstack.org/#/c/154029/ > > I've already outlined my objections to this approach elsewhere, but to > summarise I think this fixes 1 symptom of a design problem, and leaves > the rest untouched. If the value of nova compute's `host` changes, then > the assumption that instances associated with that compute can be > identified by the value of instance.host becomes invalid. This > assumption is pervasive, so it breaks a lot of stuff. The worst one is > _destroy_evacuated_instances(), which Gary found, but if you scan > nova/compute/manager for the string 'self.host' you'll find lots of > them. For example, all the periodic tasks are broken, including image > cache management, and the state of ResourceTracker will be unusual. > Worse, whenever a new instance is created it will have a different value > of instance.host, so instances running on a single hypervisor will > become partitioned based on which nova compute was used to create them. > > In short, the system may appear to function superficially, but it's > unsupportable. > > I had an alternative idea. The current assumption is that the `host` > managing a single hypervisor never changes. If we break that assumption, > we break Nova, so we could assert it at startup and refuse to start if > it's violated. I posted this VMware-specific POC: > > https://review.openstack.org/#/c/154907/ > > However, I think I've had a better idea. Nova creates ComputeNode > objects for its current configuration at startup which, amongst other > things, are a map of host:hypervisor_hostname. We could assert when > creating a ComputeNode that hypervisor_hostname is not already > associated with a different host, and refuse to start if it is. We would > give an appropriate error message explaining that this is a > misconfiguration. This would prevent the user from hitting any of the > associated problems, including the deletion of all their instances.
I have posted a patch implementing the above for review here: https://review.openstack.org/#/c/158269/ Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
