Google uses both of these patterns ("rate limit your rollouts" and "one, few, many") together in many of its systems; the value of these patterns has been proven many, many times in allowing us to catch "unexpected" failures ("it worked fine in testing, and in the first few hosts we updated, and in the first few clusters, but then it blew up...") before they swept through an entire service or the whole fleet.
A corollary of these patterns is that you have to *design* your systems and services so that these patterns are *possible* to implement. If your service assumes that every instance is running identical code, and doesn't include provisions for slow rollouts of new versions (or even new settings or configs for existing versions), it means that every rollout has to be all-or-nothing, and you're eventually going to have a very bad day... Another pattern that Google typically uses is the "big red button": make sure that your rollout mechanism has a well-known, tested, reliable, and easy-to-use way to pause a rollout in progress, and don't be afraid to use it. Something else that Google SRE places a lot of emphasis on (because it has repeatedly helped us limit or mitigate the consequences of problems like this) is to ensure that you can always roll back to the last working config, and to not be afraid to use that rollback capability. We try really hard to avoid "one way" changes, that there's no easy (or even feasible) way to reverse; it takes effort to design systems such that that's possible, but it's well worth the effort. If you are interested in these sorts of issues, you should seriously consider coming to the inaugural USENIX "SREcon" in a couple of weeks (Fri 30 May 2014, in Santa Clara CA; https://www.usenix.org/conference/srecon14), which Google is helping sponsor. There will be a heavy Google SRE presence there (including me), all prepared to talk about stuff like this. The keynote speaker is Ben Treynor, Google's "VP, 24x7", who created and still heads Google SRE (as well as several other groups, such as our networking team, data center team, and cloud management team). -Brent (long-time USENIX and charter LOPSA member, and Google SRE, but not speaking on behalf of Google) On Mon, May 19, 2014 at 10:29 AM, Dana Quinn <dqu...@gmail.com> wrote: > For those of you sharing these tales of woe (I have some similar ones I > could share) - can you share what you discussed in your post mortems as to > protect against these issues in the future? One thing I'm curious about is > whether you discussed doing slower rollouts to the "production" > environment. Did you come up with any general approaches or rules for > these type of rollouts? For example only rolling out to 5% of servers at > a time. Or another idea is a pattern i've heard discussed for production > rollouts: One, Few, Many - which is as it sounds, rollout to one host in > production, observe, then a few servers, then rollout more broadly. Just > curious what learnings and what approaches you've taken from these > incidents (an incident is a terrible thing to waste!). > > Dana > > > On Mon, May 19, 2014 at 8:20 AM, Ski Kacoroski <kacoro...@gmail.com>wrote: > >> Paul & David, >> >> So very true. I learned that the hard way when we had a bug in a >> configuration one-liner that renamed /etc to /somethingelse across 20 >> different kinds of unix (I was working for a software development house >> and we shipped on all of them). Took over a day to break into each one >> and rename it back. Hardest was Dec Tru64 which required pressing a >> special key combination at just the right time in the boot sequence. >> >> ski >> >> On Mon, 19 May 2014 06:41:19 -0700 >> Paul Graydon <p...@paulgraydon.co.uk> wrote: >> >> > At a previous job we managed to push out passwd file to several >> > hundred servers without a root account in it. (we'd forgotten to make >> > root a protected account that could never expire in the generating >> > script we used with cfengine) That was fun. All sorts of stuff broke >> > in some very interesting ways. That lead to a fun day of running >> > around servers with recovery disks and replacing the passwd and >> > shadow files. >> > >> > David Lang <da...@lang.hm> wrote: >> > >> > >to err is human, to really foul things up requires a computer >> > > >> > >...and when you automate changes to computers.... >> > > >> > >I've done similar things, not reformatting everything, but I managed >> > >to use an automation tool to break all 250 firewalls in at >> > >$prior_job in a way that disabled the automation at the same time, >> > >requiring booting from recovery media and manual changes to each box >> > >to recover. To complicate things, the firewalls mostly continued to >> > >work, so we had to juggle the fixes to avoid breaking things even >> > >worse. >> > > >> > >The good news was that the automation was good enough that I was >> > >able to give a couple people instructions on how to recover and we >> > >got everything fixed in a few hours, but it was an interesting >> > >afternoon. >> > > >> > >David Lang >> > > >> > >On Sun, 18 May 2014, Nick Webb wrote: >> > > >> > >> On Sun, May 18, 2014 at 9:38 PM, David Lang <da...@lang.hm> wrote: >> > >> >> > >>> wayback to the rescue >> > >>> >> > >>> http://web.archive.org/web/20140516225155/http://it. >> > >>> emory.edu/windows7-incident/ >> > >>> >> > >>> >> > >> I hang my head in shame for not checking there! >> > >> >> > >> Wow this is/was a nightmare. For those of us working on automation >> > >> initiatives, this is one downside to be careful of... when it's so >> > >> easy to make a mass change we must take extra care... >> > >> >> > >_______________________________________________ >> > >Discuss mailing list >> > >Discuss@lists.lopsa.org >> > >https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> > >This list provided by the League of Professional System >> > >Administrators >> > > http://lopsa.org/ >> > _______________________________________________ >> > Discuss mailing list >> > Discuss@lists.lopsa.org >> > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> > This list provided by the League of Professional System Administrators >> > http://lopsa.org/ >> >> >> >> -- >> "When we try to pick out anything by itself, we find it >> connected to the entire universe" John Muir >> >> Chris "Ski" Kacoroski, kacoro...@gmail.com, 206-501-9803 >> or ski98033 on most IM services >> >> _______________________________________________ >> Discuss mailing list >> Discuss@lists.lopsa.org >> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> This list provided by the League of Professional System Administrators >> http://lopsa.org/ >> > > > > -- > Dana Quinn > da...@pobox.com > > _______________________________________________ > Discuss mailing list > Discuss@lists.lopsa.org > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > This list provided by the League of Professional System Administrators > http://lopsa.org/ > >
_______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/