This is another slightly-meta topic I'm afraid. tl;dr / tl;wr: * QA other peoples code. * rollback immediately if something is unsafe to deploy. Do not wait to fix it : the time it takes ec2 to validate your fix is too long. * Risky branches are considered harmful. Make them non-risky. (special note: lockstep changes where ('all of subject must be verbed') considered harmful. Doing them is almost certain to cost more than doing things incrementally.)
Our deployment process for a single revision is a pipeline: 5minutes PQM code lands in trunk 5 hours BUILDBOT -> tested by buildbot -> fails or 5 minutes PQM code lands in stable 20 minutes DEPLOY -> deployment -> fails or 15 minutes QATAGGER ready for qa ??? HUMAN -> qa -> fails and rollback [by inserting a rollback at the top] or ready for deployment Every time a step in the pipeline fails in such a way that we have to start over (e.g. landing a rollback, restarting buildbot), we have to pay the entire cost of the pipeline through to the point humans can make a qa assessment again. This pipeline is truncated - I don't include ec2 (because it doesn't interact with other attempted landings), and I don't include deployment (because once a rev *can* be deployed [that is, all its predecessors are good too] it is through the pipeline and unaffected by subsequent landings. So every time we land a change we have an expected overhead of 5.75 hours if nothing goes wrong. This is increased by things that might go wrong - for instance, landing a branch that has a 50% chance of failing in this pipeline increases the expected overhead: either 5.75 hours, or if it does fail, 11.5 hours, and a 50% chance of that happening. The key characteristic of this pipeline is that no item can complete its path through the pipeline until the item before it has completed. We land about 200 revisions a month - this has been pretty stable over the last year - rev 11268 was one year back, and we're on 13558 now, or 190 a month. There are 20 work days in a month, or about 10 landings a day, or 0.416 per hour. So *optimally* our system is going have just over 2 revisions entering the pipeline in the time it takes one revision to traverse it. Now, consider the impact of a failure in the front of the pipeline: not only will we have to start over with a fix for the failure, another 2 revisions will enter the pipeline while we do that. If *either* fail, we have to fix them and start over, before we can use the fix for the very first one that started this. As an example, say we have nothing in the pipeline at all, and we start with rev A, which is broken. Time 0 Rev A lands buildbot starts on rev A Rev B lands Rev C lands Rev A is on qastaging. Time 5.75 buildbot starts on rev C Rev A marked bad. Rev D, a rollback for Rev A lands. Rev C is on qastaging Time 11.5 buildbot starts on Rev D Rev B marked bad and so it goes until we rollback *and* the new incoming revisions have no failures of their own. Sadly, I suspect this pattern will seem all to familiar to anyone who has been doing deployment requests and looking at our deployment reports. So, - on average - 2 new revisions every pass through the cycle, if our expected failure rate were to be above 50%, we would only have 25% chance of stablising and being ready for a deploy. These are the independent variables we are dealing with: expectedfailurerate length of pipeline There is a dependent variable: # of unknown revisions between a known bad revision and its fix (whether that is a rollback or a fix AKA rollforward) Changing the minimum length of the pipeline in a meaningful way requires a massive improvement in test suite timings, which people care about but isn't resourced *yet*. Note however that the length of the pipeline extends indefinitely when we have delays in QA. So the only things we can control are the expected failure rate for landings, and the amount of delay between a revision which might be bad being QA'd [and rollbacked if needed]. Because the minimum pipeline length is nearly 6 hours, we *should expect* that we cannot qa our own code except when we land it first thing in our mornings... Depending on self-qa would make our pipeline 16 hours long (end of one day to the start of the next) at best. Recently we've had a particularly hard time getting to a deployable state, and I think it has been due to a higher than regular failure rate for branches at and post-epic. We need to be quite sensitive to increased risk in branch landings, or we get into this unstable state quite easily. The higher the risk of failure, the greater the risk of a 5.75 hour stall. Note that this isn't a 'work harder' problem: we can never be totally sure about a branch; that is why we do QA. Instead, this is a 'when deciding how to change something, avoid choices that incur unnecessary risk' : whats necessary is an engineers choice. Some examples that come to mind: - incompatible (internal or web) API changes: if a change breaks stuff in your local branch, it may break stuff in other peoples branches, or *untested* stuff in your branch. - make the change compatibly: e.g. add a new attribute rather than redefining the existing one. - disk layout changes (e.g. where js files are compiled to etc) - check that merging the branch to an existing pre-used working dir, and running 'make run', 'make test' etc don't fallover with bad dependencies, missing files etc. The general approach is very similar to what we are now doing with schema changes to get low latency schema deploys: make the individual change simpler, doing only the work that is safe to do, and then cleaning up later. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp