On Thu, 17 Sep 2009, Esther Schindler wrote: > This could inspire an entire blog post! > > (Too bad it doesn't fit in either of my blogs, either my Developer Careers > blog at Javaworld or my open source blog at ITWorld.com.)
go for it. in many ways this is a 'the Emperor has no clothes' situation. David Lang > On Sep 16, 2009, at 6:54 PM, [email protected] wrote: > >> On Wed, 16 Sep 2009, [email protected] wrote: >> >>> Hi -- been away from my computer. (Yes, that happens!) I wasn't including >>> scheduled downtime. >> >> watch out for this one, because scheduled downtime can be lengthy and >> frequent and for end-users can be just as bad as uncheduled downtime. >> >> >> ok, a few more questions. >> >> if a large service has something happen where 10% of their customers get an >> error page, is that 'downtime' by your definition? if yes, what if a single >> customer gets an error page? where do you draw the line? >> >> if a site is taken down by a DOS attack, is that 'downtime'? >> >> say you have a single server and you unplug the network cable for 10 >> seconds? (tcp retries after 30 seconds so all the packets will get retried >> and get to their destination, just 30 seconds later than planned) >> >> if you have a system checking your site every 5 min and the site is down >> for 3 min between checks, is it 'downtime'? if it is, how would you know? >> >> if you have a business in australia and a ship destroys the undersea cables >> to other continents, is this 'downtime' for you (you are cut off from your >> markets in the US, but you are providing very speedy service for your >> customers on the same continent) >> >> on the other hand, if you have a car crash into a telecom box in the corner >> of the block your building is on and get knocked off the net entirely as a >> result, most people would count that as 'downtime' for you. >> >> >> the reason I am bringing these things up is that the definition of >> 'downtime' can be _extremely_ slippery, if you ask 20 different companies >> you will probably get 20 different definitions. in many cases it really >> boils down to "if nobody complained it's not downtime" or more formally "if >> none of the monitoring systems called it an outage it's not downtime", >> frequently with monitoring systems being set to require something fail two >> tests in a row with a test interval of a couple of min before calling it an >> outage. >> >> >> >> in any case, for a well run system with redundancy engineered in, unplanned >> downtime, especially downtime that affects a significant portion of the >> userbase should be something that happens once every several years. it will >> eventually happen to everyone, but if you allow for scheduled time to not >> count against you, you are not deferring maintinance and can perform >> upgrades, even datacenter moves with a bit of planning. >> >> by the way, engineering the redundancy in ends up helping in two ways, you >> survive unexpected failures, but you can also use that redundancy to allow >> you to do almost all maintinance without having to have planned downtime, >> your sysadmins aren't under as much preasure to get things done fast, and >> so they make fewer mistakes (and don't have to be doing all their work >> around midnight ;-) >> >> the vast majority of services availble today do not meet the criteria I >> just listed, and they are surprisingly sucessful without doing so >> >> >> >> managers and CEOs love to talk about how many "9's" of uptime they have. >> those of us in the trenches know that things are seldom as pretty as they >> would like everyone to think >> >> but even the best planning will occasionally run into problems. take a look >> at the google outage a few weeks ago, they have a well designed, well >> tested redundancy plan, but ran into a capacity problem they hadn't >> anticipated when they took a portion of the servers offline for >> maintinance. >> >> >> be careful about people who brag too much about their uptime, just like you >> need to be careful about people who brag too much about their security >> (remember 'unbreakable' oracle?). you can be good, you can have a solid >> track record, but you may still be only moments away from a major outage or >> breech. >> >> the name of the game is 'risk management/mitigation/minimization', it isn't >> 'risk elimination' >> >> David Lang >> >>> >>> K I M N A S H >>> Senior Editor >>> 914.962.9661 >>> Email: [email protected] >>> Twitter: http://twitter.com/knash99 >>> Web: http://www.cio.com/author/127852/Kim+S.+Nash >>> >>> -----Original Message----- >>> From: Esther Schindler [mailto:[email protected]] >>> Sent: Wednesday, September 16, 2009 7:38 PM >>> To: [email protected] >>> Cc: LOPSA Discuss List; Kim Nash >>> Subject: Re: [lopsa-discuss] easy question (I hope) to help a journalist >>> (not me) >>> >>> On Sep 16, 2009, at 1:54 PM, [email protected] wrote: >>>> do you include scheduled maintinance time as 'downtime'? >>> >>> Good question. I don't know the answer. Hopefully Kim will respond, >>> though she's not on the list it'd be a private message. I dare say >>> it'd be okay for you to repost her answer. >>> >> > _______________________________________________ Discuss mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
