Hi. Earlier today we did a successful test run of the fastdowntime database deployment process.
At 2011-09-08 08:43 we shutdown the soyuz systems At 2011-09-08 08:53:05 we entered 'downtime' and made a no-op update, running just trusted.sql and resetting security. At 2011-09-08 08:55:17 the outage completed. 2 minutes, 12 seconds. All live systems failed during the outage as expected, generating an OOPS storm. They all successfully recovered after the outage. The trial tells us that the process seems solid enough, and that our overhead for applying database patches is about 2 minutes and 12 seconds. The most significant optimization we can make to reduce this overhead is to switch to Slony-I 2.0 (we are currently running 1.2, but U1 is already running 2.0 happily). We are scheduled for a real run 2011-09-09 08:30 UTC, applying our backlog of database patches. The outage should be well within our 5 minute window. It is not yet known which parts of soyuz will be kept live during the update, and which fragile parts will be shutdown for a longer period. I think these are the main issues raised: * 'long running transaction' threshold was too low. Bug #844616 * There are rogue archive-publisher connections, likely from command line tools being run interactively. These need to connect as different database users so we can identify and deal with them appropriately, rather than just refuse to kick off the process as archive-publisher has been deemed a 'fragile user'. * People hate being presented with an OOPS screen on the main appserver during the outage. Bug #844631 I expect the next issue to be raised is how to cope with the OOPS storm in our reports. I think we need to inform the reports about outages and get them to ignore OOPSes during this window, and generate a special report for the outage windows to ensure that systems are failing the way they should be failing. Thanks for everyone who helped get us this far, particularly all the patch and buildbot wranglers who untangled the web of rollbacks and rollbacks of rollbacks and helped get qa'd revisions of code onto servers so we could finally do this. -- Stuart Bishop <stu...@stuartbishop.net> http://www.stuartbishop.net/ _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp