On Wed, 16 Sep 2009, [email protected] wrote: > Hi -- been away from my computer. (Yes, that happens!) I wasn't including > scheduled downtime.
watch out for this one, because scheduled downtime can be lengthy and frequent and for end-users can be just as bad as uncheduled downtime. ok, a few more questions. if a large service has something happen where 10% of their customers get an error page, is that 'downtime' by your definition? if yes, what if a single customer gets an error page? where do you draw the line? if a site is taken down by a DOS attack, is that 'downtime'? say you have a single server and you unplug the network cable for 10 seconds? (tcp retries after 30 seconds so all the packets will get retried and get to their destination, just 30 seconds later than planned) if you have a system checking your site every 5 min and the site is down for 3 min between checks, is it 'downtime'? if it is, how would you know? if you have a business in australia and a ship destroys the undersea cables to other continents, is this 'downtime' for you (you are cut off from your markets in the US, but you are providing very speedy service for your customers on the same continent) on the other hand, if you have a car crash into a telecom box in the corner of the block your building is on and get knocked off the net entirely as a result, most people would count that as 'downtime' for you. the reason I am bringing these things up is that the definition of 'downtime' can be _extremely_ slippery, if you ask 20 different companies you will probably get 20 different definitions. in many cases it really boils down to "if nobody complained it's not downtime" or more formally "if none of the monitoring systems called it an outage it's not downtime", frequently with monitoring systems being set to require something fail two tests in a row with a test interval of a couple of min before calling it an outage. in any case, for a well run system with redundancy engineered in, unplanned downtime, especially downtime that affects a significant portion of the userbase should be something that happens once every several years. it will eventually happen to everyone, but if you allow for scheduled time to not count against you, you are not deferring maintinance and can perform upgrades, even datacenter moves with a bit of planning. by the way, engineering the redundancy in ends up helping in two ways, you survive unexpected failures, but you can also use that redundancy to allow you to do almost all maintinance without having to have planned downtime, your sysadmins aren't under as much preasure to get things done fast, and so they make fewer mistakes (and don't have to be doing all their work around midnight ;-) the vast majority of services availble today do not meet the criteria I just listed, and they are surprisingly sucessful without doing so managers and CEOs love to talk about how many "9's" of uptime they have. those of us in the trenches know that things are seldom as pretty as they would like everyone to think but even the best planning will occasionally run into problems. take a look at the google outage a few weeks ago, they have a well designed, well tested redundancy plan, but ran into a capacity problem they hadn't anticipated when they took a portion of the servers offline for maintinance. be careful about people who brag too much about their uptime, just like you need to be careful about people who brag too much about their security (remember 'unbreakable' oracle?). you can be good, you can have a solid track record, but you may still be only moments away from a major outage or breech. the name of the game is 'risk management/mitigation/minimization', it isn't 'risk elimination' David Lang > > K I M N A S H > Senior Editor > 914.962.9661 > Email: [email protected] > Twitter: http://twitter.com/knash99 > Web: http://www.cio.com/author/127852/Kim+S.+Nash > > -----Original Message----- > From: Esther Schindler [mailto:[email protected]] > Sent: Wednesday, September 16, 2009 7:38 PM > To: [email protected] > Cc: LOPSA Discuss List; Kim Nash > Subject: Re: [lopsa-discuss] easy question (I hope) to help a journalist (not > me) > > On Sep 16, 2009, at 1:54 PM, [email protected] wrote: >> do you include scheduled maintinance time as 'downtime'? > > Good question. I don't know the answer. Hopefully Kim will respond, > though she's not on the list it'd be a private message. I dare say > it'd be okay for you to repost her answer. > _______________________________________________ Discuss mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
