On Wed, 16 Sep 2009, [email protected] wrote:

> Hi -- been away from my computer. (Yes, that happens!) I wasn't including 
> scheduled downtime.

watch out for this one, because scheduled downtime can be lengthy and 
frequent and for end-users can be just as bad as uncheduled downtime.


ok, a few more questions.

if a large service has something happen where 10% of their customers get 
an error page, is that 'downtime' by your definition? if yes, what if a 
single customer gets an error page? where do you draw the line?

if a site is taken down by a DOS attack, is that 'downtime'?

say you have a single server and you unplug the network cable for 10 
seconds? (tcp retries after 30 seconds so all the packets will get retried 
and get to their destination, just 30 seconds later than planned)

if you have a system checking your site every 5 min and the site is down 
for 3 min between checks, is it 'downtime'? if it is, how would you know?

if you have a business in australia and a ship destroys the undersea 
cables to other continents, is this 'downtime' for you (you are cut off 
from your markets in the US, but you are providing very speedy service for 
your customers on the same continent)

on the other hand, if you have a car crash into a telecom box in the 
corner of the block your building is on and get knocked off the net 
entirely as a result, most people would count that as 'downtime' for you.


the reason I am bringing these things up is that the definition of 
'downtime' can be _extremely_ slippery, if you ask 20 different companies 
you will probably get 20 different definitions. in many cases it really 
boils down to "if nobody complained it's not downtime" or more formally 
"if none of the monitoring systems called it an outage it's not downtime", 
frequently with monitoring systems being set to require something fail two 
tests in a row with a test interval of a couple of min before calling it 
an outage.



in any case, for a well run system with redundancy engineered in, 
unplanned downtime, especially downtime that affects a significant portion 
of the userbase should be something that happens once every several 
years. it will eventually happen to everyone, but if you allow for 
scheduled time to not count against you, you are not deferring 
maintinance and can perform upgrades, even datacenter moves with a bit of 
planning.

by the way, engineering the redundancy in ends up helping in two ways, you 
survive unexpected failures, but you can also use that redundancy to allow 
you to do almost all maintinance without having to have planned downtime, 
your sysadmins aren't under as much preasure to get things done fast, and 
so they make fewer mistakes (and don't have to be doing all their work 
around midnight ;-)

the vast majority of services availble today do not meet the criteria 
I just listed, and they are surprisingly sucessful without doing so



managers and CEOs love to talk about how many "9's" of uptime they have. 
those of us in the trenches know that things are seldom as pretty as they 
would like everyone to think

but even the best planning will occasionally run into problems. take a 
look at the google outage a few weeks ago, they have a well designed, well 
tested redundancy plan, but ran into a capacity problem they hadn't 
anticipated when they took a portion of the servers offline for 
maintinance.


be careful about people who brag too much about their uptime, just like 
you need to be careful about people who brag too much about their security 
(remember 'unbreakable' oracle?). you can be good, you can have a solid 
track record, but you may still be only moments away from a major outage 
or breech.

the name of the game is 'risk management/mitigation/minimization', it 
isn't 'risk elimination'

David Lang

>
> K I M   N A S H
> Senior Editor
> 914.962.9661
> Email: [email protected]
> Twitter: http://twitter.com/knash99
> Web: http://www.cio.com/author/127852/Kim+S.+Nash
>
> -----Original Message-----
> From: Esther Schindler [mailto:[email protected]]
> Sent: Wednesday, September 16, 2009 7:38 PM
> To: [email protected]
> Cc: LOPSA Discuss List; Kim Nash
> Subject: Re: [lopsa-discuss] easy question (I hope) to help a journalist (not 
> me)
>
> On Sep 16, 2009, at 1:54 PM, [email protected] wrote:
>> do you include scheduled maintinance time as 'downtime'?
>
> Good question. I don't know the answer. Hopefully Kim will respond,
> though she's not on the list it'd be a private message. I dare say
> it'd be okay for you to repost her answer.
>
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to