Re: [lopsa-discuss] Emory University re-imaged everything by accident

Brent Chapman Mon, 19 May 2014 11:35:57 -0700

Sanity checks on generated files are great, but they only go so far...
 Essentially, they help protect you against problems and failure modes that
you *anticipate*, but they don't do much to help against modes that you
don't.  Rollout rate limiting is a more generic defense; it gives you a
fighting chance to notice and respond to problems that you *didn't*
anticipate, before those problems are replicated fleet-wide.  I'm not
suggesting doing rollout rate limiting instead of sanity checks; I'm saying
that they're both good ideas, and that a good deployment system should do
both.



-Brent


On Mon, May 19, 2014 at 11:23 AM, Paul Graydon <p...@paulgraydon.co.uk>wrote:

>  In our case we discussed limiting rollout rate, but decided it
> ultimately wasn't worth it.  For everything else we already did it, passwd
> files were the main exception.  With no LDAP system for auth (too much
> overhead, not enough value was the general view given access to the hosts
> was fairly limited anyway), we were dependent on those local files, and it
> was the only way we could cut someone off from access to those systems.
>
> The main post-mortem tasks were mostly around fixing the generation
> scripts so that it was impossible for important accounts to expire (we
> found a few others that were potential problems).
>
> Paul
>
>
> On 05/19/14 10:29, Dana Quinn wrote:
>
> For those of you sharing these tales of woe (I have some similar ones I
> could share) - can you share what you discussed in your post mortems as to
> protect against these issues in the future?  One thing I'm curious about is
> whether you discussed doing slower rollouts to the "production"
> environment.   Did you come up with any general approaches or rules for
> these type of rollouts?   For example only rolling out to 5% of servers at
> a time.  Or another idea is a pattern i've heard discussed for production
> rollouts: One, Few, Many - which is as it sounds, rollout to one host in
> production, observe, then a few servers, then rollout more broadly.  Just
> curious what learnings and what approaches you've taken from these
> incidents (an incident is a terrible thing to waste!).
>
>  Dana
>
>
> On Mon, May 19, 2014 at 8:20 AM, Ski Kacoroski <kacoro...@gmail.com>wrote:
>
>> Paul & David,
>>
>> So very true.  I learned that the hard way when we had a bug in a
>> configuration one-liner that renamed /etc to /somethingelse across 20
>> different kinds of unix (I was working for a software development house
>> and we shipped on all of them).  Took over a day to break into each one
>> and rename it back.  Hardest was Dec Tru64 which required pressing a
>> special key combination at just the right time in the boot sequence.
>>
>> ski
>>
>> On Mon, 19 May 2014 06:41:19 -0700
>> Paul Graydon <p...@paulgraydon.co.uk> wrote:
>>
>> > At a previous job we managed to push out passwd file to several
>> > hundred servers without a root account in it. (we'd forgotten to make
>> > root a protected account that could never expire in the generating
>> > script we used with cfengine) That was fun. All sorts of stuff broke
>> > in some very interesting ways. That lead to a fun day of running
>> > around servers with recovery disks and replacing the passwd and
>> > shadow files.
>> >
>> > David Lang <da...@lang.hm> wrote:
>> >
>> > >to err is human, to really foul things up requires a computer
>> > >
>> > >...and when you automate changes to computers....
>> > >
>> > >I've done similar things, not reformatting everything, but I managed
>> > >to use an automation tool to break all 250 firewalls in at
>> > >$prior_job in a way that disabled the automation at the same time,
>> > >requiring booting from recovery media and manual changes to each box
>> > >to recover. To complicate things, the firewalls mostly continued to
>> > >work, so we had to juggle the fixes to avoid breaking things even
>> > >worse.
>> > >
>> > >The good news was that the automation was good enough that I was
>> > >able to give a couple people instructions on how to recover and we
>> > >got everything fixed in a few hours, but it was an interesting
>> > >afternoon.
>> > >
>> > >David Lang
>> > >
>> > >On Sun, 18 May 2014, Nick Webb wrote:
>> > >
>> > >> On Sun, May 18, 2014 at 9:38 PM, David Lang <da...@lang.hm> wrote:
>> > >>
>> > >>> wayback to the rescue
>> > >>>
>> > >>> http://web.archive.org/web/20140516225155/http://it.
>> > >>> emory.edu/windows7-incident/
>> > >>>
>> > >>>
>> > >> I hang my head in shame for not checking there!
>> > >>
>> > >> Wow this is/was a nightmare. For those of us working on automation
>> > >> initiatives, this is one downside to be careful of... when it's so
>> > >> easy to make a mass change we must take extra care...
>> > >>
>> > >_______________________________________________
>> > >Discuss mailing list
>> > >Discuss@lists.lopsa.org
>> > >https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
>> > >This list provided by the League of Professional System
>> > >Administrators
>> > > http://lopsa.org/
>> > _______________________________________________
>> > Discuss mailing list
>> > Discuss@lists.lopsa.org
>> > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
>> > This list provided by the League of Professional System Administrators
>> >  http://lopsa.org/
>>
>>
>>
>>  --
>> "When we try to pick out anything by itself, we find it
>>   connected to the entire universe"            John Muir
>>
>> Chris "Ski" Kacoroski, kacoro...@gmail.com, 206-501-9803
>> or ski98033 on most IM services
>>
>> _______________________________________________
>> Discuss mailing list
>> Discuss@lists.lopsa.org
>> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
>> This list provided by the League of Professional System Administrators
>>  http://lopsa.org/
>>
>
>
>
>  --
> Dana Quinn
> da...@pobox.com
>
>
>
> _______________________________________________
> Discuss mailing list
> Discuss@lists.lopsa.org
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
> This list provided by the League of Professional System Administrators
>  http://lopsa.org/
>
>

_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Emory University re-imaged *everything* by accident

Reply via email to

Re: [lopsa-discuss] Emory University re-imaged everything by accident