On Thu, 17 Sep 2009, Esther Schindler wrote:

> This could inspire an entire blog post!
>
> (Too bad it doesn't fit in either of my blogs, either my Developer Careers 
> blog at Javaworld or my open source blog at ITWorld.com.)

go for it. in many ways this is a 'the Emperor has no clothes' situation.

David Lang

> On Sep 16, 2009, at 6:54 PM, [email protected] wrote:
>
>> On Wed, 16 Sep 2009, [email protected] wrote:
>> 
>>> Hi -- been away from my computer. (Yes, that happens!) I wasn't including 
>>> scheduled downtime.
>> 
>> watch out for this one, because scheduled downtime can be lengthy and 
>> frequent and for end-users can be just as bad as uncheduled downtime.
>> 
>> 
>> ok, a few more questions.
>> 
>> if a large service has something happen where 10% of their customers get an 
>> error page, is that 'downtime' by your definition? if yes, what if a single 
>> customer gets an error page? where do you draw the line?
>> 
>> if a site is taken down by a DOS attack, is that 'downtime'?
>> 
>> say you have a single server and you unplug the network cable for 10 
>> seconds? (tcp retries after 30 seconds so all the packets will get retried 
>> and get to their destination, just 30 seconds later than planned)
>> 
>> if you have a system checking your site every 5 min and the site is down 
>> for 3 min between checks, is it 'downtime'? if it is, how would you know?
>> 
>> if you have a business in australia and a ship destroys the undersea cables 
>> to other continents, is this 'downtime' for you (you are cut off from your 
>> markets in the US, but you are providing very speedy service for your 
>> customers on the same continent)
>> 
>> on the other hand, if you have a car crash into a telecom box in the corner 
>> of the block your building is on and get knocked off the net entirely as a 
>> result, most people would count that as 'downtime' for you.
>> 
>> 
>> the reason I am bringing these things up is that the definition of 
>> 'downtime' can be _extremely_ slippery, if you ask 20 different companies 
>> you will probably get 20 different definitions. in many cases it really 
>> boils down to "if nobody complained it's not downtime" or more formally "if 
>> none of the monitoring systems called it an outage it's not downtime", 
>> frequently with monitoring systems being set to require something fail two 
>> tests in a row with a test interval of a couple of min before calling it an 
>> outage.
>> 
>> 
>> 
>> in any case, for a well run system with redundancy engineered in, unplanned 
>> downtime, especially downtime that affects a significant portion of the 
>> userbase should be something that happens once every several years. it will 
>> eventually happen to everyone, but if you allow for scheduled time to not 
>> count against you, you are not deferring maintinance and can perform 
>> upgrades, even datacenter moves with a bit of planning.
>> 
>> by the way, engineering the redundancy in ends up helping in two ways, you 
>> survive unexpected failures, but you can also use that redundancy to allow 
>> you to do almost all maintinance without having to have planned downtime, 
>> your sysadmins aren't under as much preasure to get things done fast, and 
>> so they make fewer mistakes (and don't have to be doing all their work 
>> around midnight ;-)
>> 
>> the vast majority of services availble today do not meet the criteria I 
>> just listed, and they are surprisingly sucessful without doing so
>> 
>> 
>> 
>> managers and CEOs love to talk about how many "9's" of uptime they have. 
>> those of us in the trenches know that things are seldom as pretty as they 
>> would like everyone to think
>> 
>> but even the best planning will occasionally run into problems. take a look 
>> at the google outage a few weeks ago, they have a well designed, well 
>> tested redundancy plan, but ran into a capacity problem they hadn't 
>> anticipated when they took a portion of the servers offline for 
>> maintinance.
>> 
>> 
>> be careful about people who brag too much about their uptime, just like you 
>> need to be careful about people who brag too much about their security 
>> (remember 'unbreakable' oracle?). you can be good, you can have a solid 
>> track record, but you may still be only moments away from a major outage or 
>> breech.
>> 
>> the name of the game is 'risk management/mitigation/minimization', it isn't 
>> 'risk elimination'
>> 
>> David Lang
>> 
>>> 
>>> K I M   N A S H
>>> Senior Editor
>>> 914.962.9661
>>> Email: [email protected]
>>> Twitter: http://twitter.com/knash99
>>> Web: http://www.cio.com/author/127852/Kim+S.+Nash
>>> 
>>> -----Original Message-----
>>> From: Esther Schindler [mailto:[email protected]]
>>> Sent: Wednesday, September 16, 2009 7:38 PM
>>> To: [email protected]
>>> Cc: LOPSA Discuss List; Kim Nash
>>> Subject: Re: [lopsa-discuss] easy question (I hope) to help a journalist 
>>> (not me)
>>> 
>>> On Sep 16, 2009, at 1:54 PM, [email protected] wrote:
>>>> do you include scheduled maintinance time as 'downtime'?
>>> 
>>> Good question. I don't know the answer. Hopefully Kim will respond,
>>> though she's not on the list it'd be a private message. I dare say
>>> it'd be okay for you to repost her answer.
>>> 
>> 
>
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to