Tom Limoncelli <[email protected]> writes: > On Thu, Apr 21, 2011 at 10:43 AM, Tom Perrine <[email protected]> wrote: > > Are there any areas where LOPSA should be speaking out? > > ** The Terry Childs case in SF. Technically he was abiding by the > town's rule against giving the root password to unauthorized people; > the state couldn't list anyone "authorized" except the major, and they > wouldn't let him give it to the major. (the transcript of that session > was fun to read; I wish I could find the link again) > http://blogs.computerworld.com/14592/good_news_for_jailed_sf_net_admin_terry_childs
Alternately, you could think of this as a specific example of why you need to solve the bus number problem. If you had a policy that at least two people should be able to do the job at any time, and you made finding and training the second person part of the job responsibilities, the problem likely wouldn't have become a problem in the first place. I mean, what would have happened if the man got hit by a bus? then no amount of "get the mayor in here" would have recovered the password. Yes, the other lesson is that some people take security policies /very/ seriously, and you should thus put some effort into defining them such that they work even when people interpret them literally, but like I said, they had solved the bus number problem, this problem wouldn't have come up at all. In some ways, you've gotta admire the man. I mean, would you go to jail for your employer's security policy? I'm honestly not sure that I would, and unlike Childs, I would actually have something to gain out of resisting an unpopular government action. > ** The AWS / EC2 outage: That companies should be careful when they > decide what "zone diversity" means to them. > http://justinsb.posterous.com/aws-down-why-the-sky-is-falling Eh, the important lesson there, I think, is that often "the number of entities that need to screw up before I'm screwed" is more important than geographical redundancies. Admin error is a lot more common than earthquakes large enough to disturb a data center, comets, and nuclear bombs combined. The only situations I've personally seen where an outage would have been prevented by using data centers 100 miles apart vs. 100' apart were attributable to poor fiber redundancy or poor routing redundancy. Of course, separating your data centers by 100 miles gets you much better fiber diversity than data centers 100' apart, unless you do a lot of extra work verifying shit that people tend to lie about, so there /is/ a strong argument for geographical redundancy, but that argument is mostly "bandwidth providers tend to conceal or even lie about fiber paths and fiber redundancy. Geographical redundancy is probably the most reliable way to get fiber redundancy and protect against backhoe accidents" not the more generally espoused "some giant disaster might wipe out both data centers" I mean, recent events in japan show that such disasters do happen occasionally, but they are uncommon; it hasn't happened to any data center I've had to deal with in my career, and I've seen quite a few backhoe accidents hit my pager. geographical redundancy sometimes also means grid power redundancy, which is also nice, so yeah, geographical redundancy is good, just not really for the reasons that are usually given to management. (of course, sometimes you can get grid redundancy in the 100' away situation. I am in at least one data center that claims to be on the border of two different grids. It seems like a reasonable strategy to take for a business who's largest variable cost is going to be power.) But yeah, admin error, in my experience, is a lot more common than any of those things that can be helped by geographical redundancy. Unfortunately, preventing admin error also seems to be more difficult than setting up geographical redundancy. I've actually been thinking a lot about this problem lately; I mean, even if more than one third party needs to screw it up for me to be down, (and I'm not even that far yet) it's /quite/ difficult to get into a situation where you aren't completely screwed if you or your company screws it up. At nearly all the smaller places I've worked, you'd only need to compromise one of the Sysadmins credentials to delete all data in production /and/ to delete all backups. (yes, you can set up a 'pull' from the backup server, which helps a lot, but usually the thing is remotely manageable, which gets you back to the problem of compromised sysadmin credentials (or compromised sysadmin)== game over.) add to that the fact that most backups run as root (this is one advantage of the old style 'raw device dump' - as I recall, operator would have read only access to the raw disk) so now the guy who manages your backup server essentially has root on production. (yeah, you can mitigate that by restricting the root shell the backup server key gets, but that is difficult to do well, so most people don't do it at all.) But, how do you make is so that you aren't completely stuffed if you yourself screw something up, or if your credentials are compromised? I was thinking of the worst-case, backups, 'cause I've started working on the 'lowered expectations' version of my low performance and very low cost storage service. I was thinking that an outsourced storage service that would let you specify a 'destroy on' date, but that wouldn't let you overwrite the data before that date. It would go a long ways towards making it so that both you and I would need to screw up at the same time before you lost data. (assuming, of course, that your production is hosted elsewhere; I think hosting production and your only backups with the same third party is a ridiculously bad idea.) _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
