Tom Limoncelli <[email protected]> writes:

> On Thu, Apr 21, 2011 at 10:43 AM, Tom Perrine <[email protected]> wrote:
> > Are there any areas where LOPSA should be speaking out?
> 
> ** The Terry Childs case in SF.  Technically he was abiding by the
> town's rule against giving the root password to unauthorized people;
> the state couldn't list anyone "authorized" except the major, and they
> wouldn't let him give it to the major. (the transcript of that session
> was fun to read; I wish I could find the link again)
> http://blogs.computerworld.com/14592/good_news_for_jailed_sf_net_admin_terry_childs

Alternately, you could think of this as a specific example of why
you need to solve the bus number problem.  If you had a policy that at 
least two people should be able to do the job at any time, and you 
made finding and training the second person part of the job 
responsibilities, the problem likely wouldn't have become a problem in 
the first place.   I mean, what would have happened if the man got
hit by a bus?  then no amount of "get the mayor in here" would have 
recovered the password.   

Yes, the other lesson is that some people take security policies 
/very/ seriously, and you should thus put some effort into defining 
them such that they work even when people interpret them literally,
but like I said, they had solved the bus number problem, this problem 
wouldn't have come up at all.  

In some ways, you've gotta admire the man.  I mean, would you go to jail
for your employer's security policy?  I'm honestly not sure that I would,
and unlike Childs, I would actually have something to gain out of resisting
an unpopular government action.  

> **  The AWS / EC2 outage: That companies should be careful when they
> decide what "zone diversity" means to them.
> http://justinsb.posterous.com/aws-down-why-the-sky-is-falling

Eh, the important lesson there, I think, is that often "the number of
entities that need to screw up before I'm screwed"  is more important
than geographical redundancies.  


Admin error is a lot more common than earthquakes large enough to disturb 
a data center, comets, and nuclear bombs combined.  The only situations 
I've personally seen where an outage would have been prevented by using 
data centers 100 miles apart vs. 100' apart were attributable to poor 
fiber redundancy or poor routing redundancy.

Of course, separating your data centers by 100 miles gets you
much better fiber diversity than data centers 100' apart, unless you
do a lot of extra work verifying shit that people tend to lie about, 
so there /is/ a strong argument for geographical redundancy, but
that argument is mostly "bandwidth providers tend to conceal or
even lie about fiber paths and fiber redundancy.  Geographical 
redundancy is probably the most reliable way to get fiber redundancy 
and protect against backhoe accidents"  not the more generally 
espoused "some giant disaster might wipe out both data centers"  I mean, 
recent events in japan show that such disasters do happen occasionally,
but they are uncommon; it hasn't happened to any data center
I've had to deal with in my career, and I've seen quite a few backhoe 
accidents hit my pager.

geographical redundancy sometimes also means grid power redundancy, 
which is also nice, so yeah, geographical redundancy is good, just 
not really for the reasons that are usually given to management.
(of course, sometimes you can get grid redundancy in the 100' away
situation.  I am in at least one data center that claims to be on the border
of two different grids.  It seems like a reasonable strategy to take
for a business who's largest variable cost is going to be power.)

But yeah, admin error, in my experience, is a lot more common than
any of those things that can be helped by geographical redundancy.
Unfortunately, preventing admin error also seems to be more difficult 
than setting up geographical redundancy.  

I've actually been thinking a lot about this problem lately;  I mean,
even if more than one third party needs to screw it up for me to 
be down, (and I'm not even that far yet)  it's /quite/ difficult
to get into a situation where you aren't completely screwed if you
or your company screws it up.   At nearly all the smaller places I've 
worked, you'd only need to compromise one of the Sysadmins credentials
to delete all data in production /and/ to delete all backups.  (yes, you
can set up a 'pull' from the backup server, which helps a lot, but usually
the thing is remotely manageable, which gets you back to the problem 
of compromised sysadmin credentials (or compromised sysadmin)== game over.) 
add to that the fact that most backups run as root (this is one advantage
of the old style 'raw device dump'  -  as I recall, operator would have
read only access to the raw disk)  so now the guy who
manages your backup server essentially has root on production.
(yeah, you can mitigate that by restricting the root shell the backup 
server key gets, but that is difficult to do well, so most people don't
do it at all.) 

But, how do you make is so that you aren't completely stuffed if you
yourself screw something up, or if your credentials are compromised?

I was thinking of the worst-case, backups, 'cause I've started working 
on the 'lowered expectations' version of my low performance and very 
low cost storage service.   I was thinking that an outsourced storage 
service that would let you specify a 'destroy on' date, but that wouldn't 
let you overwrite the data before that date. It would go a long ways 
towards making it so that both you and I would need to screw up at the 
same time before you lost data.   


(assuming, of course, that your production is hosted elsewhere;  I think
hosting production and your only backups with the same third party is a
ridiculously bad idea.)



_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to