For those of you sharing these tales of woe (I have some similar ones I could share) - can you share what you discussed in your post mortems as to protect against these issues in the future? One thing I'm curious about is whether you discussed doing slower rollouts to the "production" environment. Did you come up with any general approaches or rules for these type of rollouts? For example only rolling out to 5% of servers at a time. Or another idea is a pattern i've heard discussed for production rollouts: One, Few, Many - which is as it sounds, rollout to one host in production, observe, then a few servers, then rollout more broadly. Just curious what learnings and what approaches you've taken from these incidents (an incident is a terrible thing to waste!).
Dana On Mon, May 19, 2014 at 8:20 AM, Ski Kacoroski <kacoro...@gmail.com> wrote: > Paul & David, > > So very true. I learned that the hard way when we had a bug in a > configuration one-liner that renamed /etc to /somethingelse across 20 > different kinds of unix (I was working for a software development house > and we shipped on all of them). Took over a day to break into each one > and rename it back. Hardest was Dec Tru64 which required pressing a > special key combination at just the right time in the boot sequence. > > ski > > On Mon, 19 May 2014 06:41:19 -0700 > Paul Graydon <p...@paulgraydon.co.uk> wrote: > > > At a previous job we managed to push out passwd file to several > > hundred servers without a root account in it. (we'd forgotten to make > > root a protected account that could never expire in the generating > > script we used with cfengine) That was fun. All sorts of stuff broke > > in some very interesting ways. That lead to a fun day of running > > around servers with recovery disks and replacing the passwd and > > shadow files. > > > > David Lang <da...@lang.hm> wrote: > > > > >to err is human, to really foul things up requires a computer > > > > > >...and when you automate changes to computers.... > > > > > >I've done similar things, not reformatting everything, but I managed > > >to use an automation tool to break all 250 firewalls in at > > >$prior_job in a way that disabled the automation at the same time, > > >requiring booting from recovery media and manual changes to each box > > >to recover. To complicate things, the firewalls mostly continued to > > >work, so we had to juggle the fixes to avoid breaking things even > > >worse. > > > > > >The good news was that the automation was good enough that I was > > >able to give a couple people instructions on how to recover and we > > >got everything fixed in a few hours, but it was an interesting > > >afternoon. > > > > > >David Lang > > > > > >On Sun, 18 May 2014, Nick Webb wrote: > > > > > >> On Sun, May 18, 2014 at 9:38 PM, David Lang <da...@lang.hm> wrote: > > >> > > >>> wayback to the rescue > > >>> > > >>> http://web.archive.org/web/20140516225155/http://it. > > >>> emory.edu/windows7-incident/ > > >>> > > >>> > > >> I hang my head in shame for not checking there! > > >> > > >> Wow this is/was a nightmare. For those of us working on automation > > >> initiatives, this is one downside to be careful of... when it's so > > >> easy to make a mass change we must take extra care... > > >> > > >_______________________________________________ > > >Discuss mailing list > > >Discuss@lists.lopsa.org > > >https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > > >This list provided by the League of Professional System > > >Administrators > > > http://lopsa.org/ > > _______________________________________________ > > Discuss mailing list > > Discuss@lists.lopsa.org > > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > > This list provided by the League of Professional System Administrators > > http://lopsa.org/ > > > > -- > "When we try to pick out anything by itself, we find it > connected to the entire universe" John Muir > > Chris "Ski" Kacoroski, kacoro...@gmail.com, 206-501-9803 > or ski98033 on most IM services > > _______________________________________________ > Discuss mailing list > Discuss@lists.lopsa.org > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > This list provided by the League of Professional System Administrators > http://lopsa.org/ > -- Dana Quinn da...@pobox.com
_______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/