That sounds great. Readonly mode has some substantial drawbacks, so if anything can be gained by getting rid of it, I say go ahead
* actually doing lp-related work generally requires write access * users can be led up the garden path (and not in a good way) if they start doing work without noticing they can't commit it to lp * code hosting, the API, and perhaps other service points don't understand readonly mode, and fail in a way that is no better than being down -- it would be better to give clean errors when it actually is down. (for the api, the fault might lie with lplib not lp itself) * there are Google and archive.org caches of pages if you really need to see them * heavy users are likely to have their key data offline in bug/review mail already * there are a bunch of bugs about readonly mode, and more turn up from time to time; dealing with them is a waste Martin On 13 June 2011 18:50, Robert Collins <robe...@robertcollins.net> wrote: > Of the many performance problems we're working on, the downtime deploy > process is particularly important: while downtime deploys are slow, we > cannot do them frequently. While we cannot do them frequently, folk > have to work off on db-devel, leading to repeated merge conflicts, > difficulty in delivering completed work, and extreme time lapses > between completing something and being able to consider it done. > > The latency exacerbates our downtime window - folk don't want to run > background migrations if they can avoid it, because they have to wait > to land the follow on code for both the schema change *and* the > migration to happen. This makes downtime deploys riskier, more likely > to fail and so compounds the issue. > > When we are applying schema changes we use a thing we call 'readonly' > mode to provide some services while the schema change is being > applied. > > readonly mode is expensive though. Currently the first 15-20 minutes > of user visible downtime are entirely involved with switching across > to readonly mode. After the deploy all looptuner scripts are blocked > for 24 hours while we rebuild the readonly slave, and the restoration > of services is time consuming while we bring enough capacity online to > handle our user load. We have to bring up new appservers because we're > always deploying changes which are incompatible with the python schema > definitions we have. > > Now, there are some things we can do to make getting into readonly > mode faster : we can use stub's ini file to signal readonly mode as > soon as an appserver reads the file - say once every 2 seconds or some > such. If we let existing requests complete we'd be looking at ~22 > seconds to get into readonly mode. This won't address the need to > bring up new appservers, nor the overhead of rebuilding the readonly > replica. > > Those overheads shouldn't be underestimated: even if we get the entry > into readonly mode down to 30 seconds, the readonly replica breakout > is still serialised, and if we were to do a 60second downtime once a > week then a full 1/7th of the time we would be busy rebuilding > replicas. We can't do a re-spin of a failed schema change in less than > 24 hours after a previous one. The overheads of switch appserver > instances are also non-trivial, it takes nearly an hour to do a > nodowntime deploy, and similarly in readonly mode as we tear down the > old readonly instances and bring up the new ones.Total downtime we're > looking at is schema patch time + ~4-5 minutes. > > There is another way to tackle the problem though: we can just switch > off access to the database cluster, apply schema changes, and reenable > access. Most schema changes are things we can do pretty easily without > breaking compatibility with the python code[1]. The ones that are > harder we can probably still do with a little thought. > > In this model, a downtime deploy would be as follows: > - ~30 minutes before shutdown scripts that can't be inerrupted > - @T=0 disable access to the database > - apply the schema change > - reenable access to the db > > No appserver bouncing, nothing. We could show an error page on the > appservers during this time - we can iterate to make that pretty, and > we could use the aforementioned ini file to tell the appservers that a > schema change is going on. Total downtime: schema-patch time + ~60 > seconds. > > In terms of development process we would land schema changes that are > compatible with the python code on devel - just the schema change, no > python code changes at all. Then we'd do nodowntime deploys as normal > up to and past that revision; when a good time to do the downtime > arrives (e.g. we might set a fixed time of the day or week) we'd do > the downtime deploy described above. After thats live developers would > then land code that uses the new schema / populates new columns etc. > > If we did have something that required new appservers to be deployed, we'd: > - do a nodowntime deploy up to the revision with the schema change in it > - decide when we would be doing the downtime and prevent all deploys > until that time > - then do a hybrid deploy: > - ~30 minutes before shutdown scripts that can't be interrupted > - stage the new code > - @T=0 disable access to the database > - in parallel (massively parallel - do all appservers and the > schema all at once) > - apply the schema change > - kill appservers with -9 and restart them with the new code base > - reenable access to the db > > But I think we'd want to aggressively avoid such scenarios as being > harder and more complex to execute on, as well as having more > downtime. (Total downtime will be schema time + 4-5 minutes as per > readonly mode). The ability to do the schema change early means we can > stop bundling the new python code with the schema change without > making folk have to carry the stuff in an unmerged branch for extended > periods of time (which readonly mode requires). > > Have I missed some way we can mitigate the costs of readonly mode, or > something we'd have to have present if we ditched it? > > If I haven't, I propose that we: > - consult with our users and see if they have concerns or ideas we > haven't considered > - get stakeholder buy in (I'm going to forward this to the > stakeholders now to start discussion) > - identify the core facilities we need to move to this process > - stop bundling schema & python changes immediately. > > We could, if we want, start landing schema things live on devel next week. > > -Rob > > 1]: We don't split the schema change and python definitions today > because there is little benefit to doing so. But if doing so gets us > shorter cycle times, then we will get benefits. > > _______________________________________________ > Mailing list: https://launchpad.net/~launchpad-dev > Post to : launchpad-dev@lists.launchpad.net > Unsubscribe : https://launchpad.net/~launchpad-dev > More help : https://help.launchpad.net/ListHelp > > _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp