On 5/14/15 11:40 AM, Mike Bayer wrote:
The "online schema changes" patch has been abandoned. I regret that I was not able to review the full nature of this spec in time to note some concerns I have, namely that Alembic does not plan on ever acheiving 100% "automation" of migration generation; such a thing is not possible and would require a vast amount of development resources in any case to constantly keep up with the ever changing features and behaviors of all target databases. The online migration spec, AFAICT, does not offer any place for manual migrations to be added, and I think this will be a major problem. The decisions made by "autogenerate" I have always stated should always be manually reviewed and corrected, so I'd be very nervous about a system that uses autogenerate on the fly and sends those changes directly to a production database without any review.

So it looks like Nova has decided at the summit to forge ahead with online schema migrations. Travel issues prevented me from being present at the summit and therefore the session where this was being discussed. But had I been there, a short 40 minute session wouldn't have been a venue in which I could have organized my thoughts enough to be any more effective in discussing this feature, so it's probably better that I wasn't there. As I've mentioned before, the timing of the blueprint on this feature was just not well synchronized for me, it being proposed the first month I was working for Red Hat and Openstack and hardly knew what things were, and as you can see by my comments at https://review.openstack.org/#/c/102545/9, I was primarily alarmed at the notion that this system was going to be built entirely on SQLAlchemy-Migrate internals, a project which one of my primary tasks at my new job was to get replaced with Alembic. I hardly understood what the actual proposal was as I was still learning how to install Openstack at that point, so I really missed being able to dig deeply into it. The spec went quiet after a few weeks and I mostly forgot about it, until in November when it suddenly awoke, burned ahead through Christmas and was approved on Jan 6. Again, terrible timing for me, as my wife gave birth to our son in late October, and I was pretty much 24/7 dealing with a newborn, not to mention getting through the holidays. So I missed the boat on the blueprint entirely.

For now, I have to assume that Nova will go ahead with this. But let me at least take some effort to explain more fully what I think the problem with this approach is. I don't think this problem will necessarily be that big a deal for Nova, at least most of the time; but when it is a problem, it might be pretty bad. My concern is that the system has no way at all to provide for manual migration steps, or any control at all as to how schema migrations proceed; and critically, that it makes no provisions for the very common case of schema migrations that also need data to be moved and manipulated as well. The blueprint has no mention whatsoever regarding how the migrations of data will be handled; not even within sections such as "Developer impact" or "Testing". Right now, data migrations are just part of the sqlalchemy-migrate scripts or Alembic scripts. But with the change that we no longer write such scripts, nor do we even have a place to put them if we wanted, data migrations are no longer integrated within this system and have to be dealt with externally.

It may be the case that Nova has a schema that is no longer in need of major changes, and we are only talking about adding new columns, new tables to support new features, and removing some old cruft; but moving data around is just not going to be needed. But once you build a system that makes data migrations second class or even non-citizens, you close the doors on how much you can do with your schema. Big changes down the road are basically no longer possible without the ability to also migrate data as the DDL is emitted.

So OK, of course we can still do data migrations. The spec doesn't need to say anything, it should be obvious that they need to be performed during the "migrate" phase, in between "expand" and "contract" when you have both the new tables/columns available as a destination for data and the old tables/columns still present as the source. As far as what form they take, we no longer have migration scripts or versions within a major release, so we have to assume it will be just a big series of scripts somewhere, tagged towards the major release like "Kilo" or "Liberty", and it's just a bunch of database code that runs at the same time we're in "migrate".

I have no doubt that's what Nova will do, and usually it will be fine. But to illustrate, here's the kind of place that goes wrong in such a way that is at best pretty annoying and at worst a serious and error-prone development and testing headache. Let's start with a hypothetical schema that has some design issues. Two tables "widget" and "widget_status", where "widget_status" has some kind of information about a "widget", and it also stores a timestamp, unfortunately as a string:

CREATE TABLE widget (
    id INTEGER PRIMARY KEY,
    name VARCHAR(30) NOT NULL
)

CREATE TABLE widget_status (
   widget_id INTEGER PRIMARY KEY REFERENCES widget(id),
   status_flag INTEGER NOT NULL,
   modified_date VARCHAR(30)
)


Let's say that two entirely different changes by two different developers want to accomplish two things: 1. convert "modified_date" into a new column "modified_timestamp" which is a DATETIME type, not a string and 2. merge these two tables into one, as the need to JOIN all the time is non-performant and unnecessary. That is, we'll end up with this:


CREATE TABLE widget (
    id INTEGER PRIMARY KEY,
    name VARCHAR(30) NOT NULL
   status_flag INTEGER NOT NULL,
   modified_timestamp DATETIME
)

Right off, let's keep in mind that when online schema migrations runs, the fact that there's a #1 and a #2 migration to the schema is lost. Even though migration #1 will add "modified_timestamp" to "widget_status", when we run the sum of both #1 and #2, that interim state of the schema will never exist; no changes are made to widget_status except for the final DROP, as these changes aren't visible by just looking at the two endpoints of the schema, which unless I'm totally misunderstanding, is how online schema changes work.

Developer #1 changes the model such that there's a new column on "widget_status" called "modified_timestamp" which includes value as a datetime; this is a new column add that replaces the modified_date column, and because we need to do a data migration, both columns need to exist simultaneously while the string-based dates are UPDATEd into the timestamp column. The developer writes a data migration script that will transfer this data while the table has both columns. If we look at the SQL mapped to online schema migration steps, they are:

"expand":  ALTER TABLE widget_status ADD COLUMN modified_timestamp DATETIME;
"migrate": UPDATE widget_status SET modified_timestamp=convert_string_to_datetime(modified_date)
"contract": ALTER TABLE widget_status DROP COLUMN modified_date

The UPDATE statement above is coded into some script somewhere, "liberty/01_migrate_widget_status_timestamp.py". The developer commits all this and everything works great.

Developer #2 comes along two months later. She looks at the model and sees no mention of any column called "modified_date"; indeed, this column name is not in the source code of the application anywhere at all, except in that liberty/01_...py script which she isn't looking at. She makes her changes to the model, moving all the columns of widget_status into widget and removing the widget_status model. She also writes a data migration script to copy all the data. If we again look at the SQL mapped to online schema migration steps, they are:

"expand":

ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME

"migrate":

UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status WHERE widget_id=widget.id) UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM widget_status WHERE widget_id=widget.id)

"contract":

DROP TABLE widget_status

Let's say above the migrate step is in another script, "liberty/02_migrate_widget_status_to_widget.py".

Some readers may see where this is going. Before where go there, let me also note that these schema changes certainly *do* require that the developer write a "migration". Online schema changes can save us for very simplistic cases but even for a basic series of operations like the above, we need a "migration". IMO it isn't a part of the "problem description", as the blueprint states, that the need for "writing a migration script" exists; but even if it is, the blueprint does not provide a solution for this issue for all but the most trivial cases. Also notice there's really no way to deal with the fact that we'd really like "widget.status_flag" to be NOT NULL; under a traditional script model, we'd add the column, populate it, then alter it to be NOT NULL; online schema migrations removes any place for this to happen, unless if we consider it to be the "contract" phase. Looking at the current code I see nothing that attempts to deal with this and the blueprint makes no mention of this thorny issue.

So with our two model changes, and our two data migration scripts, let's see what online schema changes does to it. All the "expand" steps are lumped into a single net migration, and necessarily all occur automatically, with no ability to intervene or change how they run. All the "contract" steps, same thing. Which means that the addition of the column "modified_timestamp" to "widget_status" never happens; because in the comparison of pre-expand to post-contract, the "widget_status" table is simply gone. Which means, the fact that we've dropped this table means now developer #2 has to become aware of migration 01_migrate_widget_status_timestamp.py, and change that script as well. Without changing it, this is what runs:

"expand":
ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL;
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME;

"migrate":
01 -> UPDATE widget_status SET modified_timestamp=convert_string_to_datetime(modified_date) # --> fails in all cases, no "modified_timestamp" column is present.

02 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status WHERE widget_id=widget.id) UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM widget_status WHERE widget_id=widget.id) # --> fails on a pre-migrated database, no "modified_timestamp" column is present on widget_status

"contract":
DROP TABLE widget_status

I'm not sure by what mechanism the above failure would be discovered in development. But let's assume that they certainly are, as is the case now, and in CI we do a "Kilo"->"Liberty" run with data present and the failure of these scripts is discovered.

The developer of 02_migrate_widget_status_to_widget.py basically has to remove the 01_migrate_widget_status_timestamp.py script entirely and merge the work that it does into her migration. That is, instead of having two independent and isolated data migrations:

01 -> UPDATE widget_status SET modified_timestamp=convert_string_to_datetime(modified_date)

02 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status WHERE widget_id=widget.id) UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM widget_status WHERE widget_id=widget.id)

we have to munge the two migrations together, because the state of the database that 01 was coded towards will no longer ever exist:

01 -> gone
02 -> moved to 01, and changed to:
01 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status WHERE widget_id=widget.id) UPDATE widget SET modified_timestamp=(SELECT convert_string_to_datetime(modified_date) FROM widget_status WHERE widget_id=widget.id)

The developer of 02 would not likely have a clear idea that this is how the migration has to be built, unless she carefully reads all migration scripts preceding hers that refer to the same tables or columns, or when she sees a full run from "Kilo" -> "Liberty" fail; it strongly indicates that things like looking at older versions of the code and reading through history will be part of a typical strategy in order to figure out the correct steps. Using traditional migration steps, none of this complexity is needed; data migrations can be coded independently of each other against an explicit and fixed database state that will always exist while that data migration runs.

Basically, if we start with a series of traditional schema migrations and associated data migrations, we could illustrate that as follows:

As -> Ad -> Bs -> Bd  -> Cs -> Cd -> Ds -> Dd

Where "Xs" is a schema migration and "Xd" is a data migration. Online schema changes basically remove all the "s" in between, leaving us with:

Ae -> De -> Ad -> Bd -> Cd -> Dd -> Dc -> Ac

Where "Xe" is an expand and "Xc" is a contract. Without a traditional versioning model, there is no more schema state for B or C; these states no longer exist, even though the data migrations, when they were written, were coded against these states. We've written data migrations against a schema that ceases to ever exist once a new migration is added, as below, our script that was written against the state "Ds" will also no longer have a schema at that state, once we add "E":

Ae -> Ee -> Ad -> Bd -> Cd -> Dd -> Ed -> Ec -> Ac # "De / Dc" has disappeared

Basically, online schema migrations mean that known database states are constantly being wiped out, and data migration scripts which were written against these states are constantly being broken. As the number of changes increases, the number of scripts potentially broken by losing the discrete state they were coded against and requiring manual re-coding increases.

The process of constantly folding over data migrations, written as discrete sequential steps but only at runtime interpreted against an entirely different database schema than what is made available during development, is a process that for a modest series of changes somewhat tedious and error prone, but for a series of changes that represent more serious schema refactorings, would quickly become unmanageable. The net result is that as the things we want to do to our schema leave the realm of the extremely simple and trivial, online schema migrations quickly end up *creating* more work to do than it decreases, and this could lead to a hesitancy to take on the job of doing more comprehensive schema migrations, thus making technical debt that much more difficult to clean up.

This is the very long form version of what I've been hypothesizing. If it is the case that Nova is in a place such that more comprehensive, or even modest, schema refactorings are assumed to be never needed, and changes are always going to be small, simple, and involving little to no data migration, then we're fine. But without any space to write data migrations against known states of the schema, since those known states keep disappearing, we give up the ability to make incremental changes to a database where significant data migrations are needed. I think that's a very high price to pay.







__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to