Excerpts from Jonathan Proulx's message of 2016-10-17 14:49:13 -0400: > Hi All, > > Just on the other side of a Kilo->Mitaka upgrade (with a very brief > transit through Liberty in the middle). > > As usual I've caught a few problems in production that I have no idea > how I could possibly have tested for because they relate to older > running instances and some remnants of older package versions on the > production side which wouldn't have existed in test unless I'd > installed the test server with Havana and done incremental upgrades > starting a fairly wide suite of test instances along the way. >
In general, modifying _anything_ in place is hard to test. You're much better off with as much immutable content as possible on all of your nodes. If you've been wondering what this whole Docker nonsense is about, well, that's what it's about. You docker build once per software release attempt, and then mount data read/write, and configs readonly. Both openstack-ansible and kolla are deployment projects that try to do some of this via lxc or docker, IIRC. This way when you test your container image in test, you copy it out to prod, start up the new containers, stop the old ones, and you know that _at least_ you don't have older stuff running anymore. Data and config are still likely to be the source of issues, but there are other ways to help test that. > First thing that bit me was neutron-db-manage being confused because > my production system still had migrations from Havana hanging around. > I'm calling this a packaging bug > https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1633576 but I > also feel like remembering release names forever might be a good > thing. > Ouch, indeed one of the first things to do _before_ an upgrade is to run the migrations of the current version to make sure your schema is up to date. Also it's best to make sure you have _all_ of the stable updates before you do that, since it's possible fixes have landed in the migrations that are meant to smooth the upgrade process. > Later I discovered during the Juno release (maybe earlier ones too) > making snapshot of running instances populated the snapshot's meta > data with "instance_type_vcpu_weight: none". Currently (Mitaka) this > value must be an integer if it is set or boot fails. This has the > interesting side effect of putting your instance into shutdown/error > state if you try a hard reboot of a formerly working instance. I > 'fixed' this manually frobbing the DB to set lines where > instance_type_vcpu_weight was set to none to be deleted. > This one is tough because it is clearly data and state related. It's hard to say how you got the 'none' values in there instead of ints. Somebody else suggested making db snapshots and loading them into a test control plane. That seems like an easy-ish one to do some surface level finding, but the fact is it could also be super dangerous if not isolated well, and the more isolation, the less of a real simulation it is. > Does anyone have strategies on how to actually test for problems with > "old" artifacts like these? > > Yes having things running from 18-24month old snapshots is "bad" and > yes it would be cleaner to install a fresh control plane at each > upgrade and cut over rather than doing an actual in place upgrade. But > neither of these sub-optimal patterns are going all the way away > anytime soon. > In-place upgrades must work. If they don't, please file bugs and complain loudly. :) _______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
