[Launchpad-dev] RFC: Is readonly mode fixable, or should we ditch it entirely?

Robert Collins Mon, 13 Jun 2011 18:51:05 -0700

Of the many performance problems we're working on, the downtime deploy
process is particularly important: while downtime deploys are slow, we
cannot do them frequently. While we cannot do them frequently, folk
have to work off on db-devel, leading to repeated merge conflicts,
difficulty in delivering completed work, and extreme time lapses
between completing something and being able to consider it done.


The latency exacerbates our downtime window - folk don't want to run
background migrations if they can avoid it, because they have to wait
to land the follow on code for both the schema change *and* the
migration to happen. This makes downtime deploys riskier, more likely
to fail and so compounds the issue.

When we are applying schema changes we use a thing we call 'readonly'
mode to provide some services while the schema change is being
applied.

readonly mode is expensive though. Currently the first 15-20 minutes
of user visible downtime are entirely involved with switching across
to readonly mode. After the deploy all looptuner scripts are blocked
for 24 hours while we rebuild the readonly slave, and the restoration
of services is time consuming while we bring enough capacity online to
handle our user load. We have to bring up new appservers because we're
always deploying changes which are incompatible with the python schema
definitions we have.

Now, there are some things we can do to make getting into readonly
mode faster : we can use stub's ini file to signal readonly mode as
soon as an appserver reads the file - say once every 2 seconds or some
such. If we let existing requests complete we'd be looking at ~22
seconds to get into readonly mode. This won't address the need to
bring up new appservers, nor the overhead of rebuilding the readonly
replica.

Those overheads shouldn't be underestimated: even if we get the entry
into readonly mode down to 30 seconds, the readonly replica breakout
is still serialised,  and if we were to do a 60second downtime once a
week then a full 1/7th of the time we would be busy rebuilding
replicas. We can't do a re-spin of a failed schema change in less than
24 hours after a previous one. The overheads of switch appserver
instances are also non-trivial, it takes nearly an hour to do a
nodowntime deploy, and similarly in readonly mode as we tear down the
old readonly instances and bring up the new ones.Total downtime we're
looking at is schema patch time + ~4-5 minutes.

There is another way to tackle the problem though: we can just switch
off access to the database cluster, apply schema changes, and reenable
access. Most schema changes are things we can do pretty easily without
breaking compatibility with the python code[1]. The ones that are
harder we can probably still do with a little thought.

In this model, a downtime deploy would be as follows:
 - ~30 minutes before shutdown scripts that can't be inerrupted
 - @T=0 disable access to the database
 - apply the schema change
 - reenable access to the db

No appserver bouncing, nothing. We could show an error page on the
appservers during this time - we can iterate to make that pretty, and
we could use the aforementioned ini file to tell the appservers that a
schema change is going on. Total downtime: schema-patch time + ~60
seconds.

In terms of development process we would land schema changes that are
compatible with the python code on devel - just the schema change, no
python code changes at all. Then we'd do nodowntime deploys as normal
up to and past that revision; when a good time to do the downtime
arrives (e.g. we might set a fixed time of the day or week) we'd do
the downtime deploy described above. After thats live developers would
then land code that uses the new schema / populates new columns etc.

If we did have something that required new appservers to be deployed, we'd:
 - do a nodowntime deploy up to the revision with the schema change in it
 - decide when we would be doing the downtime and prevent all deploys
until that time
 - then do a hybrid deploy:
   - ~30 minutes before shutdown scripts that can't be interrupted
   - stage the new code
   - @T=0 disable access to the database
   - in parallel (massively parallel - do all appservers and the
schema all at once)
     - apply the schema change
     - kill appservers with -9 and restart them with the new code base
   - reenable access to the db

But I think we'd want to aggressively avoid such scenarios as being
harder and more complex to execute on, as well as having more
downtime. (Total downtime will be schema time + 4-5 minutes as per
readonly mode). The ability to do the schema change early means we can
stop bundling the new python code with the schema change without
making folk have to carry the stuff in an unmerged branch for extended
periods of time (which readonly mode requires).

Have I missed some way we can mitigate the costs of readonly mode, or
something we'd have to have present if we ditched it?

If I haven't, I propose that we:
 - consult with our users and see if they have concerns or ideas we
haven't considered
 - get stakeholder buy in (I'm going to forward this to the
stakeholders now to start discussion)
 - identify the core facilities we need to move to this process
 - stop bundling schema & python changes immediately.

We could, if we want, start landing schema things live on devel next week.

-Rob

1]: We don't split the schema change and python definitions today
because there is little benefit to doing so. But if doing so gets us
shorter cycle times, then we will get benefits.

_______________________________________________
Mailing list: https://launchpad.net/~launchpad-dev
Post to     : launchpad-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp

[Launchpad-dev] RFC: Is readonly mode fixable, or should we ditch it entirely?

Reply via email to