(Sorry: resent properly tagged) We've encountered a bug in resize which resulted in data loss. The gist is that user was resizing a qcow2 instance whose image had been deleted from glance. In driver.finish_migration on the destination host, an error occurred attempting to copy the image from the source hosts's image cache, putting the instance into an error state. Note that instance.host has been set to the destination host before finish_migration runs. When the image cache cleanup ran on the source host, the instance is no longer in the list of expected instances on that host because instance.host == dest. Image cache manager expired the image from the cache, and there was no other copy of the image.
Let's ignore the root cause of the side-loading error, because that's the type of transient error which can always occur. I'm looking for a way to avoid deleting the image from the image cache in future until the resize operation has completed. The obvious way to do this is to update the instance list generated in ComputeManager._run_image_cache_manager_pass to consider not only instances where instance.host is in the node list, but also any instance with a migration record where source/dest is in the node list. The problem with this is that the data model doesn't seem to allow us to fetch the currently active migration. Following the error above, the errors_out_migration decorator on finish_resize has set the migration to an error state. AFAICT this is never deleted, so the presence of a migration in an error state only means that a migration involving this instance has occurred in the past. It doesn't mean that it's currently relevant, so it's basically meaningless. Firstly, have I missed any semantics of the migration record which might allow to me to unambiguously identify currently relevant migrations, whether in an error state or otherwise? That would be ideal, and I'd just go with that. If not, how about adding an active migration field to instance? I don't think it would ever make sense to have more than 1 current migration for a given instance. It would be set back to NULL when the migration was complete, and we'd at least have an opportunity to do something explicit with migrations in an error state. In the meantime I'm going to look for more backportable avenues to fix this. Perhaps not updating instance.host until after finish_migration. Matt -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK)
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev