On Mon, 2010-10-18 at 22:54 +1300, Robert Collins wrote: > On Mon, Oct 18, 2010 at 10:29 PM, Tom Haddon <[email protected]> wrote: > >> And at that point, if we have a security issue we have to deploy asap; > >> we'd do the following: > >> - cowboy it out there [and keep it as a cowboy on future deploys] > > > > So this means we'd be deploying a security fix without having run the > > test suite against it in a controlled environment (i.e. buildbot/PQM)? > > No, we can verify things in ec2.
*can* being the operative word here. But even so, I don't think ec2 is the "canonical" copy of production environment, otherwise we wouldn't need buildbot at all, right? > >> - land a regular branch fixing it for good > >> - remove the cowboy when the regular branch has been incorporated > >> into the main deployed codebase. > >> > >> This would chop 4 hours off the time that things take to deploy, > >> remove one buildbot queue and generally make the whole code->live > >> story a bit simpler, at the cost of making the security-fix story more > >> complex. Personally, I think that that is a net win. > > > > From the LOSA perspective, it's also a lot more work. It basically > > requires manually applying a cowboy, keeping track of where that cowboy > > is applied, disabling any auto-rollouts to that server until the cowboy > > lands, and/or checking there are no cowboys applied on any servers > > before doing any rollouts. > > We already have a means for handling cowboys; we only gain > incrementally here if we can eliminate that entire process - can we? > I'd say we *can't* today because 'zomg fix it now' stuff does happen. Erm, we don't already have a means for handling cowboys. We currently have a hacked work around that is very painful and potentially error prone for LOSAs. I'm trying to avoid that in the future, and I don't think "we do this currently" is a good justification. > I'd like to quantify how much more work it is. Say that there is one > security landing a month, and we're deploying individual revisions. > The extra work for handling security via a cowboy is then amortised > over 200 commits (to take the last month). If we save 5 minutes on the > inner loop for those 200 commits, and spend 2 hours dealing with that > security fix, we're still ahead 80 minutes. > > Thats not to say that 2 hours would be tolerable or a goal for > security fixes, just that *overall* its a win to take it out of the > common case completely. You're not really comparing like for like here. You're comparing 5 minutes (or whatever it is) of extra time to deploy something to 2 hours (or whatever it is) of extra LOSA time for a cowboy (plus the danger of overwriting the cowboy through human error). > > I'd propose a slight change to the above suggestion: > > > > - Keep production-devel/production-stable (now the buildbot instances > > run in the DC, there's no extra cost to doing so). > > There is a cost: we have to deal with test runs that fail; we have to > resource test runs on it. If we parallelise the test suite we're going > to be wanting serious grunt to run the test suite, and that CPU time > doesn't come free. We also need engineering and sysadmin time to > manage the instance and have to handle dealing with it during upgrades > (e.g. the lucid one we just did). > > That said, I'm open to keeping it but not deploying from it except > when there is a security issue. This kind of defeats the point. If we're not deploying from the same branch all the time, there's extra manual (and error prone) steps involved. > > - Have an automated job that pulls frequently (or pushes immediately) > > from the "approved" stable revno to production-stable > > we can do that, but I'd do it ondemand: If prepping a security fix, > request this, and build on that. > > > - Security fixes still go through production-devel -> production stable > > and can then subsequently be landed on devel->stable after having been > > rolled out. > > Sure. > > > The advantage of this is that LOSAs can *always* deploy from the tip of > > production-stable. > > I don't see that being implied by the changes you suggest. Erm, I must not be explaining it properly then, because that's *exactly* the outcome of what I'm proposing. Can you let me know how that's not clear so I can try and explain it a little more? > > No approval is needed, and once we get to the stage > > of automating deployments that becomes a *lot* easier. > > We're aiming for automated-doing-the-deployment not > automated-triggering-of-deployments. Ok, so now I definitely am misunderstanding things. Are you saying we don't want to automatically roll things out (as a longer term goal)? If not, I don't think the extra 5 minutes of using the production-stable branch (which means we consistently deploy from the same branch) will make any difference. > The former adds reliability and speed to doing them, the latter adds > risk in the event that people are busy. > > We already, per the new process, have trivial-approvement deployments > (though our toolchain needs to catch up, and we can't actually > *action* it till we have qastaging live with edge deployments turned > off). > > So, for clarity, how does the following strike you as an interim > position (with a review after 6 months): > - keep prod-stable/devel > - on request deployments from stable except when doing a security fix > - cron job to push from stable to prod-devel/prod-stable To be clear, I'm proposing a cron job to push from stable to production-devel and production-stable so the test suite doesn't have to run for production-devel -> production-stable, unless we're doing a security fix. Not sure if that was clear. Tom > -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

