Of the many performance problems we're working on, the downtime deploy process is particularly important: while downtime deploys are slow, we cannot do them frequently. While we cannot do them frequently, folk have to work off on db-devel, leading to repeated merge conflicts, difficulty in delivering completed work, and extreme time lapses between completing something and being able to consider it done.
The latency exacerbates our downtime window - folk don't want to run background migrations if they can avoid it, because they have to wait to land the follow on code for both the schema change *and* the migration to happen. This makes downtime deploys riskier, more likely to fail and so compounds the issue. When we are applying schema changes we use a thing we call 'readonly' mode to provide some services while the schema change is being applied. readonly mode is expensive though. Currently the first 15-20 minutes of user visible downtime are entirely involved with switching across to readonly mode. After the deploy all looptuner scripts are blocked for 24 hours while we rebuild the readonly slave, and the restoration of services is time consuming while we bring enough capacity online to handle our user load. We have to bring up new appservers because we're always deploying changes which are incompatible with the python schema definitions we have. Now, there are some things we can do to make getting into readonly mode faster : we can use stub's ini file to signal readonly mode as soon as an appserver reads the file - say once every 2 seconds or some such. If we let existing requests complete we'd be looking at ~22 seconds to get into readonly mode. This won't address the need to bring up new appservers, nor the overhead of rebuilding the readonly replica. Those overheads shouldn't be underestimated: even if we get the entry into readonly mode down to 30 seconds, the readonly replica breakout is still serialised, and if we were to do a 60second downtime once a week then a full 1/7th of the time we would be busy rebuilding replicas. We can't do a re-spin of a failed schema change in less than 24 hours after a previous one. The overheads of switch appserver instances are also non-trivial, it takes nearly an hour to do a nodowntime deploy, and similarly in readonly mode as we tear down the old readonly instances and bring up the new ones.Total downtime we're looking at is schema patch time + ~4-5 minutes. There is another way to tackle the problem though: we can just switch off access to the database cluster, apply schema changes, and reenable access. Most schema changes are things we can do pretty easily without breaking compatibility with the python code[1]. The ones that are harder we can probably still do with a little thought. In this model, a downtime deploy would be as follows: - ~30 minutes before shutdown scripts that can't be inerrupted - @T=0 disable access to the database - apply the schema change - reenable access to the db No appserver bouncing, nothing. We could show an error page on the appservers during this time - we can iterate to make that pretty, and we could use the aforementioned ini file to tell the appservers that a schema change is going on. Total downtime: schema-patch time + ~60 seconds. In terms of development process we would land schema changes that are compatible with the python code on devel - just the schema change, no python code changes at all. Then we'd do nodowntime deploys as normal up to and past that revision; when a good time to do the downtime arrives (e.g. we might set a fixed time of the day or week) we'd do the downtime deploy described above. After thats live developers would then land code that uses the new schema / populates new columns etc. If we did have something that required new appservers to be deployed, we'd: - do a nodowntime deploy up to the revision with the schema change in it - decide when we would be doing the downtime and prevent all deploys until that time - then do a hybrid deploy: - ~30 minutes before shutdown scripts that can't be interrupted - stage the new code - @T=0 disable access to the database - in parallel (massively parallel - do all appservers and the schema all at once) - apply the schema change - kill appservers with -9 and restart them with the new code base - reenable access to the db But I think we'd want to aggressively avoid such scenarios as being harder and more complex to execute on, as well as having more downtime. (Total downtime will be schema time + 4-5 minutes as per readonly mode). The ability to do the schema change early means we can stop bundling the new python code with the schema change without making folk have to carry the stuff in an unmerged branch for extended periods of time (which readonly mode requires). Have I missed some way we can mitigate the costs of readonly mode, or something we'd have to have present if we ditched it? If I haven't, I propose that we: - consult with our users and see if they have concerns or ideas we haven't considered - get stakeholder buy in (I'm going to forward this to the stakeholders now to start discussion) - identify the core facilities we need to move to this process - stop bundling schema & python changes immediately. We could, if we want, start landing schema things live on devel next week. -Rob 1]: We don't split the schema change and python definitions today because there is little benefit to doing so. But if doing so gets us shorter cycle times, then we will get benefits. _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp