Re: [openstack-dev] [nova][heat] sqlalchemy-migrate tool to alembic

Mike Bayer Thu, 21 May 2015 00:03:12 -0700


On 5/14/15 11:40 AM, Mike Bayer wrote:

The "online schema changes" patch has been abandoned. I regret thatI was not able to review the full nature of this spec in time to notesome concerns I have, namely that Alembic does not plan on everacheiving 100% "automation" of migration generation; such a thing isnot possible and would require a vast amount of development resourcesin any case to constantly keep up with the ever changing features andbehaviors of all target databases. The online migration spec, AFAICT,does not offer any place for manual migrations to be added, and Ithink this will be a major problem. The decisions made by"autogenerate" I have always stated should always be manually reviewedand corrected, so I'd be very nervous about a system that usesautogenerate on the fly and sends those changes directly to aproduction database without any review.

So it looks like Nova has decided at the summit to forge ahead withonline schema migrations. Travel issues prevented me from beingpresent at the summit and therefore the session where this was beingdiscussed. But had I been there, a short 40 minute session wouldn'thave been a venue in which I could have organized my thoughts enough tobe any more effective in discussing this feature, so it's probablybetter that I wasn't there. As I've mentioned before, the timing ofthe blueprint on this feature was just not well synchronized for me, itbeing proposed the first month I was working for Red Hat and Openstackand hardly knew what things were, and as you can see by my comments athttps://review.openstack.org/#/c/102545/9, I was primarily alarmed atthe notion that this system was going to be built entirely onSQLAlchemy-Migrate internals, a project which one of my primary tasks atmy new job was to get replaced with Alembic. I hardly understood whatthe actual proposal was as I was still learning how to install Openstackat that point, so I really missed being able to dig deeply into it. Thespec went quiet after a few weeks and I mostly forgot about it, until inNovember when it suddenly awoke, burned ahead through Christmas and wasapproved on Jan 6. Again, terrible timing for me, as my wife gave birthto our son in late October, and I was pretty much 24/7 dealing with anewborn, not to mention getting through the holidays. So I missed theboat on the blueprint entirely.

For now, I have to assume that Nova will go ahead with this. But let meat least take some effort to explain more fully what I think the problemwith this approach is. I don't think this problem will necessarily bethat big a deal for Nova, at least most of the time; but when it is aproblem, it might be pretty bad. My concern is that the system has noway at all to provide for manual migration steps, or any control at allas to how schema migrations proceed; and critically, that it makes noprovisions for the very common case of schema migrations that also needdata to be moved and manipulated as well. The blueprint has no mentionwhatsoever regarding how the migrations of data will be handled; noteven within sections such as "Developer impact" or "Testing". Rightnow, data migrations are just part of the sqlalchemy-migrate scripts orAlembic scripts. But with the change that we no longer write suchscripts, nor do we even have a place to put them if we wanted, datamigrations are no longer integrated within this system and have to bedealt with externally.

It may be the case that Nova has a schema that is no longer in need ofmajor changes, and we are only talking about adding new columns, newtables to support new features, and removing some old cruft; but movingdata around is just not going to be needed. But once you build asystem that makes data migrations second class or even non-citizens, youclose the doors on how much you can do with your schema. Big changesdown the road are basically no longer possible without the ability toalso migrate data as the DDL is emitted.

So OK, of course we can still do data migrations. The spec doesn't needto say anything, it should be obvious that they need to be performedduring the "migrate" phase, in between "expand" and "contract" when youhave both the new tables/columns available as a destination for data andthe old tables/columns still present as the source. As far as whatform they take, we no longer have migration scripts or versions within amajor release, so we have to assume it will be just a big series ofscripts somewhere, tagged towards the major release like "Kilo" or"Liberty", and it's just a bunch of database code that runs at the sametime we're in "migrate".

I have no doubt that's what Nova will do, and usually it will be fine.But to illustrate, here's the kind of place that goes wrong in such away that is at best pretty annoying and at worst a serious anderror-prone development and testing headache. Let's start with ahypothetical schema that has some design issues. Two tables "widget"and "widget_status", where "widget_status" has some kind of informationabout a "widget", and it also stores a timestamp, unfortunately as a string:


CREATE TABLE widget (
    id INTEGER PRIMARY KEY,
    name VARCHAR(30) NOT NULL
)

CREATE TABLE widget_status (
   widget_id INTEGER PRIMARY KEY REFERENCES widget(id),
   status_flag INTEGER NOT NULL,
   modified_date VARCHAR(30)
)

Let's say that two entirely different changes by two differentdevelopers want to accomplish two things: 1. convert "modified_date"into a new column "modified_timestamp" which is a DATETIME type, not astring and 2. merge these two tables into one, as the need to JOIN allthe time is non-performant and unnecessary. That is, we'll end up withthis:



CREATE TABLE widget (
    id INTEGER PRIMARY KEY,
    name VARCHAR(30) NOT NULL
   status_flag INTEGER NOT NULL,
   modified_timestamp DATETIME
)

Right off, let's keep in mind that when online schema migrations runs,the fact that there's a #1 and a #2 migration to the schema is lost.Even though migration #1 will add "modified_timestamp" to"widget_status", when we run the sum of both #1 and #2, that interimstate of the schema will never exist; no changes are made towidget_status except for the final DROP, as these changes aren't visibleby just looking at the two endpoints of the schema, which unless I'mtotally misunderstanding, is how online schema changes work.

Developer #1 changes the model such that there's a new column on"widget_status" called "modified_timestamp" which includes value as adatetime; this is a new column add that replaces the modified_datecolumn, and because we need to do a data migration, both columns need toexist simultaneously while the string-based dates are UPDATEd into thetimestamp column. The developer writes a data migration script thatwill transfer this data while the table has both columns. If we look atthe SQL mapped to online schema migration steps, they are:


"expand":  ALTER TABLE widget_status ADD COLUMN modified_timestamp DATETIME;

"migrate": UPDATE widget_status SETmodified_timestamp=convert_string_to_datetime(modified_date)

"contract": ALTER TABLE widget_status DROP COLUMN modified_date

The UPDATE statement above is coded into some script somewhere,"liberty/01_migrate_widget_status_timestamp.py". The developer commitsall this and everything works great.

Developer #2 comes along two months later. She looks at the model andsees no mention of any column called "modified_date"; indeed, thiscolumn name is not in the source code of the application anywhere atall, except in that liberty/01_...py script which she isn't lookingat. She makes her changes to the model, moving all the columns ofwidget_status into widget and removing the widget_status model. Shealso writes a data migration script to copy all the data. If we againlook at the SQL mapped to online schema migration steps, they are:


"expand":

ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME

"migrate":


"contract":

DROP TABLE widget_status

Let's say above the migrate step is in another script,"liberty/02_migrate_widget_status_to_widget.py".

Some readers may see where this is going. Before where go there, let mealso note that these schema changes certainly *do* require that thedeveloper write a "migration". Online schema changes can save us forvery simplistic cases but even for a basic series of operations like theabove, we need a "migration". IMO it isn't a part of the "problemdescription", as the blueprint states, that the need for "writing amigration script" exists; but even if it is, the blueprint does notprovide a solution for this issue for all but the most trivial cases.Also notice there's really no way to deal with the fact that we'd reallylike "widget.status_flag" to be NOT NULL; under a traditional scriptmodel, we'd add the column, populate it, then alter it to be NOT NULL;online schema migrations removes any place for this to happen, unless ifwe consider it to be the "contract" phase. Looking at the current codeI see nothing that attempts to deal with this and the blueprint makes nomention of this thorny issue.

So with our two model changes, and our two data migration scripts, let'ssee what online schema changes does to it. All the "expand" steps arelumped into a single net migration, and necessarily all occurautomatically, with no ability to intervene or change how they run.All the "contract" steps, same thing. Which means that the addition ofthe column "modified_timestamp" to "widget_status" never happens;because in the comparison of pre-expand to post-contract, the"widget_status" table is simply gone. Which means, the fact that we'vedropped this table means now developer #2 has to become aware ofmigration 01_migrate_widget_status_timestamp.py, and change that scriptas well. Without changing it, this is what runs:


"expand":
ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL;
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME;

"migrate":

01 -> UPDATE widget_status SETmodified_timestamp=convert_string_to_datetime(modified_date) # -->fails in all cases, no "modified_timestamp" column is present.


02 ->

UPDATE widget SET status_flag=(SELECT status_flag FROM widget_statusWHERE widget_id=widget.id)UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROMwidget_status WHERE widget_id=widget.id) # --> fails on a pre-migrateddatabase, no "modified_timestamp" column is present on widget_status


"contract":
DROP TABLE widget_status

I'm not sure by what mechanism the above failure would be discovered indevelopment. But let's assume that they certainly are, as is the casenow, and in CI we do a "Kilo"->"Liberty" run with data present and thefailure of these scripts is discovered.

The developer of 02_migrate_widget_status_to_widget.py basically has toremove the 01_migrate_widget_status_timestamp.py script entirely andmerge the work that it does into her migration. That is, instead ofhaving two independent and isolated data migrations:

01 -> UPDATE widget_status SETmodified_timestamp=convert_string_to_datetime(modified_date)


02 ->

we have to munge the two migrations together, because the state of thedatabase that 01 was coded towards will no longer ever exist:


01 -> gone
02 -> moved to 01, and changed to:
01 ->

UPDATE widget SET status_flag=(SELECT status_flag FROM widget_statusWHERE widget_id=widget.id)UPDATE widget SET modified_timestamp=(SELECTconvert_string_to_datetime(modified_date) FROM widget_status WHEREwidget_id=widget.id)

The developer of 02 would not likely have a clear idea that this is howthe migration has to be built, unless she carefully reads all migrationscripts preceding hers that refer to the same tables or columns, or whenshe sees a full run from "Kilo" -> "Liberty" fail; it strongly indicatesthat things like looking at older versions of the code and readingthrough history will be part of a typical strategy in order to figureout the correct steps. Using traditional migration steps, none of thiscomplexity is needed; data migrations can be coded independently of eachother against an explicit and fixed database state that will alwaysexist while that data migration runs.

Basically, if we start with a series of traditional schema migrationsand associated data migrations, we could illustrate that as follows:


As -> Ad -> Bs -> Bd  -> Cs -> Cd -> Ds -> Dd

Where "Xs" is a schema migration and "Xd" is a data migration. Onlineschema changes basically remove all the "s" in between, leaving us with:


Ae -> De -> Ad -> Bd -> Cd -> Dd -> Dc -> Ac

Where "Xe" is an expand and "Xc" is a contract. Without a traditionalversioning model, there is no more schema state for B or C; these statesno longer exist, even though the data migrations, when they werewritten, were coded against these states. We've written datamigrations against a schema that ceases to ever exist once a newmigration is added, as below, our script that was written against thestate "Ds" will also no longer have a schema at that state, once we add "E":

Ae -> Ee -> Ad -> Bd -> Cd -> Dd -> Ed -> Ec -> Ac # "De / Dc" hasdisappeared

Basically, online schema migrations mean that known database states areconstantly being wiped out, and data migration scripts which werewritten against these states are constantly being broken. As the numberof changes increases, the number of scripts potentially broken by losingthe discrete state they were coded against and requiring manualre-coding increases.

The process of constantly folding over data migrations, written asdiscrete sequential steps but only at runtime interpreted against anentirely different database schema than what is made available duringdevelopment, is a process that for a modest series of changes somewhattedious and error prone, but for a series of changes that represent moreserious schema refactorings, would quickly become unmanageable. Thenet result is that as the things we want to do to our schema leave therealm of the extremely simple and trivial, online schema migrationsquickly end up *creating* more work to do than it decreases, and thiscould lead to a hesitancy to take on the job of doing more comprehensiveschema migrations, thus making technical debt that much more difficultto clean up.

This is the very long form version of what I've been hypothesizing. Ifit is the case that Nova is in a place such that more comprehensive, oreven modest, schema refactorings are assumed to be never needed, andchanges are always going to be small, simple, and involving little to nodata migration, then we're fine. But without any space to write datamigrations against known states of the schema, since those known stateskeep disappearing, we give up the ability to make incremental changes toa database where significant data migrations are needed. I thinkthat's a very high price to pay.








__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][heat] sqlalchemy-migrate tool to alembic

Reply via email to