In our case we discussed limiting rollout rate, but decided it ultimately wasn't worth it. For everything else we already did it, passwd files were the main exception. With no LDAP system for auth (too much overhead, not enough value was the general view given access to the hosts was fairly limited anyway), we were dependent on those local files, and it was the only way we could cut someone off from access to those systems.

The main post-mortem tasks were mostly around fixing the generation scripts so that it was impossible for important accounts to expire (we found a few others that were potential problems).

Paul

On 05/19/14 10:29, Dana Quinn wrote:
For those of you sharing these tales of woe (I have some similar ones I could share) - can you share what you discussed in your post mortems as to protect against these issues in the future? One thing I'm curious about is whether you discussed doing slower rollouts to the "production" environment. Did you come up with any general approaches or rules for these type of rollouts? For example only rolling out to 5% of servers at a time. Or another idea is a pattern i've heard discussed for production rollouts: One, Few, Many - which is as it sounds, rollout to one host in production, observe, then a few servers, then rollout more broadly. Just curious what learnings and what approaches you've taken from these incidents (an incident is a terrible thing to waste!).

Dana


On Mon, May 19, 2014 at 8:20 AM, Ski Kacoroski <kacoro...@gmail.com <mailto:kacoro...@gmail.com>> wrote:

    Paul & David,

    So very true.  I learned that the hard way when we had a bug in a
    configuration one-liner that renamed /etc to /somethingelse across 20
    different kinds of unix (I was working for a software development
    house
    and we shipped on all of them).  Took over a day to break into
    each one
    and rename it back.  Hardest was Dec Tru64 which required pressing a
    special key combination at just the right time in the boot sequence.

    ski

    On Mon, 19 May 2014 06:41:19 -0700
    Paul Graydon <p...@paulgraydon.co.uk
    <mailto:p...@paulgraydon.co.uk>> wrote:

    > At a previous job we managed to push out passwd file to several
    > hundred servers without a root account in it. (we'd forgotten to
    make
    > root a protected account that could never expire in the generating
    > script we used with cfengine) That was fun. All sorts of stuff broke
    > in some very interesting ways. That lead to a fun day of running
    > around servers with recovery disks and replacing the passwd and
    > shadow files.
    >
    > David Lang <da...@lang.hm <mailto:da...@lang.hm>> wrote:
    >
    > >to err is human, to really foul things up requires a computer
    > >
    > >...and when you automate changes to computers....
    > >
    > >I've done similar things, not reformatting everything, but I
    managed
    > >to use an automation tool to break all 250 firewalls in at
    > >$prior_job in a way that disabled the automation at the same time,
    > >requiring booting from recovery media and manual changes to
    each box
    > >to recover. To complicate things, the firewalls mostly continued to
    > >work, so we had to juggle the fixes to avoid breaking things even
    > >worse.
    > >
    > >The good news was that the automation was good enough that I was
    > >able to give a couple people instructions on how to recover and we
    > >got everything fixed in a few hours, but it was an interesting
    > >afternoon.
    > >
    > >David Lang
    > >
    > >On Sun, 18 May 2014, Nick Webb wrote:
    > >
    > >> On Sun, May 18, 2014 at 9:38 PM, David Lang <da...@lang.hm
    <mailto:da...@lang.hm>> wrote:
    > >>
    > >>> wayback to the rescue
    > >>>
    > >>> http://web.archive.org/web/20140516225155/http://it.
    > >>> emory.edu/windows7-incident/
    <http://emory.edu/windows7-incident/>
    > >>>
    > >>>
    > >> I hang my head in shame for not checking there!
    > >>
    > >> Wow this is/was a nightmare. For those of us working on
    automation
    > >> initiatives, this is one downside to be careful of... when
    it's so
    > >> easy to make a mass change we must take extra care...
    > >>
    > >_______________________________________________
    > >Discuss mailing list
    > >Discuss@lists.lopsa.org <mailto:Discuss@lists.lopsa.org>
    > >https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
    > >This list provided by the League of Professional System
    > >Administrators
    > > http://lopsa.org/
    > _______________________________________________
    > Discuss mailing list
    > Discuss@lists.lopsa.org <mailto:Discuss@lists.lopsa.org>
    > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
    > This list provided by the League of Professional System
    Administrators
    > http://lopsa.org/



    --
    "When we try to pick out anything by itself, we find it
      connected to the entire universe"            John Muir

    Chris "Ski" Kacoroski, kacoro...@gmail.com
    <mailto:kacoro...@gmail.com>, 206-501-9803 <tel:206-501-9803>
    or ski98033 on most IM services

    _______________________________________________
    Discuss mailing list
    Discuss@lists.lopsa.org <mailto:Discuss@lists.lopsa.org>
    https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
    This list provided by the League of Professional System Administrators
    http://lopsa.org/




--
Dana Quinn
da...@pobox.com <mailto:da...@pobox.com>

_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to