Sanity checks on generated files are great, but they only go so far... Essentially, they help protect you against problems and failure modes that you *anticipate*, but they don't do much to help against modes that you don't. Rollout rate limiting is a more generic defense; it gives you a fighting chance to notice and respond to problems that you *didn't* anticipate, before those problems are replicated fleet-wide. I'm not suggesting doing rollout rate limiting instead of sanity checks; I'm saying that they're both good ideas, and that a good deployment system should do both.
-Brent On Mon, May 19, 2014 at 11:23 AM, Paul Graydon <p...@paulgraydon.co.uk>wrote: > In our case we discussed limiting rollout rate, but decided it > ultimately wasn't worth it. For everything else we already did it, passwd > files were the main exception. With no LDAP system for auth (too much > overhead, not enough value was the general view given access to the hosts > was fairly limited anyway), we were dependent on those local files, and it > was the only way we could cut someone off from access to those systems. > > The main post-mortem tasks were mostly around fixing the generation > scripts so that it was impossible for important accounts to expire (we > found a few others that were potential problems). > > Paul > > > On 05/19/14 10:29, Dana Quinn wrote: > > For those of you sharing these tales of woe (I have some similar ones I > could share) - can you share what you discussed in your post mortems as to > protect against these issues in the future? One thing I'm curious about is > whether you discussed doing slower rollouts to the "production" > environment. Did you come up with any general approaches or rules for > these type of rollouts? For example only rolling out to 5% of servers at > a time. Or another idea is a pattern i've heard discussed for production > rollouts: One, Few, Many - which is as it sounds, rollout to one host in > production, observe, then a few servers, then rollout more broadly. Just > curious what learnings and what approaches you've taken from these > incidents (an incident is a terrible thing to waste!). > > Dana > > > On Mon, May 19, 2014 at 8:20 AM, Ski Kacoroski <kacoro...@gmail.com>wrote: > >> Paul & David, >> >> So very true. I learned that the hard way when we had a bug in a >> configuration one-liner that renamed /etc to /somethingelse across 20 >> different kinds of unix (I was working for a software development house >> and we shipped on all of them). Took over a day to break into each one >> and rename it back. Hardest was Dec Tru64 which required pressing a >> special key combination at just the right time in the boot sequence. >> >> ski >> >> On Mon, 19 May 2014 06:41:19 -0700 >> Paul Graydon <p...@paulgraydon.co.uk> wrote: >> >> > At a previous job we managed to push out passwd file to several >> > hundred servers without a root account in it. (we'd forgotten to make >> > root a protected account that could never expire in the generating >> > script we used with cfengine) That was fun. All sorts of stuff broke >> > in some very interesting ways. That lead to a fun day of running >> > around servers with recovery disks and replacing the passwd and >> > shadow files. >> > >> > David Lang <da...@lang.hm> wrote: >> > >> > >to err is human, to really foul things up requires a computer >> > > >> > >...and when you automate changes to computers.... >> > > >> > >I've done similar things, not reformatting everything, but I managed >> > >to use an automation tool to break all 250 firewalls in at >> > >$prior_job in a way that disabled the automation at the same time, >> > >requiring booting from recovery media and manual changes to each box >> > >to recover. To complicate things, the firewalls mostly continued to >> > >work, so we had to juggle the fixes to avoid breaking things even >> > >worse. >> > > >> > >The good news was that the automation was good enough that I was >> > >able to give a couple people instructions on how to recover and we >> > >got everything fixed in a few hours, but it was an interesting >> > >afternoon. >> > > >> > >David Lang >> > > >> > >On Sun, 18 May 2014, Nick Webb wrote: >> > > >> > >> On Sun, May 18, 2014 at 9:38 PM, David Lang <da...@lang.hm> wrote: >> > >> >> > >>> wayback to the rescue >> > >>> >> > >>> http://web.archive.org/web/20140516225155/http://it. >> > >>> emory.edu/windows7-incident/ >> > >>> >> > >>> >> > >> I hang my head in shame for not checking there! >> > >> >> > >> Wow this is/was a nightmare. For those of us working on automation >> > >> initiatives, this is one downside to be careful of... when it's so >> > >> easy to make a mass change we must take extra care... >> > >> >> > >_______________________________________________ >> > >Discuss mailing list >> > >Discuss@lists.lopsa.org >> > >https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> > >This list provided by the League of Professional System >> > >Administrators >> > > http://lopsa.org/ >> > _______________________________________________ >> > Discuss mailing list >> > Discuss@lists.lopsa.org >> > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> > This list provided by the League of Professional System Administrators >> > http://lopsa.org/ >> >> >> >> -- >> "When we try to pick out anything by itself, we find it >> connected to the entire universe" John Muir >> >> Chris "Ski" Kacoroski, kacoro...@gmail.com, 206-501-9803 >> or ski98033 on most IM services >> >> _______________________________________________ >> Discuss mailing list >> Discuss@lists.lopsa.org >> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >> This list provided by the League of Professional System Administrators >> http://lopsa.org/ >> > > > > -- > Dana Quinn > da...@pobox.com > > > > _______________________________________________ > Discuss mailing list > Discuss@lists.lopsa.org > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > This list provided by the League of Professional System Administrators > http://lopsa.org/ > >
_______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/