I try to write about application performance in these emails, but today I'm going to exercise some editorial leeway and instead write about development performance - specifically about the latency by which schema changes get deployed to the production environment.
This mail is a little long, and there is a very important question in it, so the tl;dr version first: - if you know of a script we have, which if its database connection is interrupted requires manual fixup afterwards, please let stuart or I know. - For *all* new DB patches please aim for 10-15 second *total* application time on staging/qastaging. For assistance achieving that you're welcome to tap stuart or I. As of this week slow patches (> 15 seconds) will require signoff by me or Francis, as they will cause custom excessive downtime. - *no* more changes to both DB code and Python code at the same time: This applies to both devel and db-devel. - There will be some disruption with staging.launchpad.net as this project is worked on and tested. Now, for the actual content :)... Its likely that everyone that cares about this already knows our current process, but I'm going to summarise it now anyway for clarity. The process today is: - we decide to make a change (e.g. to improve performance, add a feature, whatever) - a patch is prepared (involves one or more devs, the dba and/or ta) - patch is reviewed and categorised as being apply-live or apply-cold - apply-live patches (https://dev.launchpad.net/Database/LivePatching) land on lp:launchpad/devel - these are then applied live by a sysadmin or dba, before the matching code change can be deployed. - apply-cold patches land on lp:launchpad/db-devel - these then stall for 2 weeks on average, until the next monthly downtime Monthly downtime involves: - shutting everything down/into read only mode - doing a deploy of new code to all the servers - breaking the read only replica - applying any pending db patches - starting up all the servers - zapping and rebuilding the read only replica We have about an hour overhead in shutting down and starting up the appservers, plus the actual db patch application itself. In addition there is an hour before the downtime where we quiesce background tasks like email sending, archive publishing and so on. As a whole this process sits in an awkward position optimization wise: while the patch application occurs infrequently, we have lots of patches queue up; while there are lots of patches, the non-application overhead is large but still in the same order-of-magnitude duration. while the non-application overhead is large we cannot do the downtime more frequently: its a self reinforcing situation. Making it worse, a lot of the overhead we have is related to making the long (up to 90 minutes) downtime tolerable, and is itself a frequent cause of delays in the application process (for instance, breaking out the readonly replica often crashes the replication software). The aggregate impact of 2 week delays on work landing is pretty significant, and so fixing this is a high priority for addressing our development cycle time. I handwaved a leaner approach a few weeks ago, discarding as much overhead as possible to make the downtime as close to actual application time as possible, and doing patches one at a time to minimise batching effects. This is now covered in https://dev.launchpad.net/LEP/FastDowntime. Stuart has been busy translating my handwave into concrete possibilities. This is deliberately -lean- - no frills, no bells, and potentially very ugly. I that we can iterate rapidly once the basic facility is in place. Now, the new process is: - we decide to make a change (e.g. to improve performance, add a feature, whatever) - a patch is prepared (involves one or more devs, the dba and/or ta) - patch is reviewed and categorised as being apply-live or apply-cold - apply-live patches (https://dev.launchpad.net/Database/LivePatching) land on lp:launchpad/devel - these are then applied live by a sysadmin or dba, before the matching code change can be deployed. - apply-cold patches land on lp:launchpad/devel - these are then applied with a fast downtime process The fast downtime involves: - ~1 hour before queisce background tasks and then at the time - preventing new connections to the db except from the patcher - check that no white-listed DB users are in the middle of transactions (and if there are, abort) - kicking all connections off the db servers - applying the patches - allowing new connections back in Now, if there are scripts that will need manual repair/recovery if they are kicked out of the DB, we need to know - to whitelist their users. Normally they should be quiesced, but if something goes wrong with that, this is a fallback step. (To whitelist them, just reply to this thread with the script user.) It will take a bit of time to get all the pieces together and working, but I'm reasonably sure that we'll have it ready to roll before the next monthly downtime window would have come around; so we'll not do that window: instead, we'll pull patches that have accrued on db-devel into devel one at a time using this new process. Once things are mature enough, we can disable db-devel altogether, but we're not there *yet*. I'll be following up this mail with a bit less detailed one for -users, and possibly a blog post. Cheers, Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp