Re: [lopsa-discuss] easy question (I hope) to help a journalist (not me)

david Thu, 17 Sep 2009 12:29:32 -0700

On Thu, 17 Sep 2009, [email protected] wrote:

> Jeesh, did I open up a can of worms or what? Here I thought I was asking 
> a straightforward question. Boy, I've gotten an education -- thanks, 
> David. Clearly, I have a lot to think about and will go back to that 
> original interviewee who made the offhand remark to me about how he was 
> proud his ERP app hadn't gone down in 343 days. He has some 'splainin to 
> do.


several years ago there was a flurry of unexpected failures when it was 
discovered that the linux kernel had a bug that caused it to crash after 
497 days of uptime

I've had systems go through rapid failover to the backup, then fail back 
to the primary because HA software had a bug that caused it to failover 
after a particular box had been up for 497 days straight.

in both of these cases I should have done system updates prior to that 
point that would have required restarting the individual boxes, but 
because I had not done so I ran into those bugs.

getting your uptime of a complex application across many servers and tiers 
up into the year+ range is an accomplishment. it does mean that you have 
solved a lot of problems to get that far (in my experiance, especially if 
you have a Microsoft infrastructure), but it's not an unusual 
accomplishment.

David Lang

> --kim
>
> K I M   N A S H
> Senior Editor
> 914.962.9661
> Email: [email protected]
> Twitter: http://twitter.com/knash99
> Web: http://www.cio.com/author/127852/Kim+S.+Nash
>
>
> -----Original Message-----
> From: Esther Schindler [mailto:[email protected]]
> Sent: Thursday, September 17, 2009 11:58 AM
> To: [email protected]
> Cc: Kim Nash; [email protected]
> Subject: Re: [lopsa-discuss] easy question (I hope) to help a journalist (not 
> me)
>
> This could inspire an entire blog post!
>
> (Too bad it doesn't fit in either of my blogs, either my Developer
> Careers blog at Javaworld or my open source blog at ITWorld.com.)
>
> On Sep 16, 2009, at 6:54 PM, [email protected] wrote:
>
>> On Wed, 16 Sep 2009, [email protected] wrote:
>>
>>> Hi -- been away from my computer. (Yes, that happens!) I wasn't
>>> including scheduled downtime.
>>
>> watch out for this one, because scheduled downtime can be lengthy
>> and frequent and for end-users can be just as bad as uncheduled
>> downtime.
>>
>>
>> ok, a few more questions.
>>
>> if a large service has something happen where 10% of their customers
>> get an error page, is that 'downtime' by your definition? if yes,
>> what if a single customer gets an error page? where do you draw the
>> line?
>>
>> if a site is taken down by a DOS attack, is that 'downtime'?
>>
>> say you have a single server and you unplug the network cable for 10
>> seconds? (tcp retries after 30 seconds so all the packets will get
>> retried and get to their destination, just 30 seconds later than
>> planned)
>>
>> if you have a system checking your site every 5 min and the site is
>> down for 3 min between checks, is it 'downtime'? if it is, how would
>> you know?
>>
>> if you have a business in australia and a ship destroys the undersea
>> cables to other continents, is this 'downtime' for you (you are cut
>> off from your markets in the US, but you are providing very speedy
>> service for your customers on the same continent)
>>
>> on the other hand, if you have a car crash into a telecom box in the
>> corner of the block your building is on and get knocked off the net
>> entirely as a result, most people would count that as 'downtime' for
>> you.
>>
>>
>> the reason I am bringing these things up is that the definition of
>> 'downtime' can be _extremely_ slippery, if you ask 20 different
>> companies you will probably get 20 different definitions. in many
>> cases it really boils down to "if nobody complained it's not
>> downtime" or more formally "if none of the monitoring systems called
>> it an outage it's not downtime", frequently with monitoring systems
>> being set to require something fail two tests in a row with a test
>> interval of a couple of min before calling it an outage.
>>
>>
>>
>> in any case, for a well run system with redundancy engineered in,
>> unplanned downtime, especially downtime that affects a significant
>> portion of the userbase should be something that happens once every
>> several years. it will eventually happen to everyone, but if you
>> allow for scheduled time to not count against you, you are not
>> deferring maintinance and can perform upgrades, even datacenter
>> moves with a bit of planning.
>>
>> by the way, engineering the redundancy in ends up helping in two
>> ways, you survive unexpected failures, but you can also use that
>> redundancy to allow you to do almost all maintinance without having
>> to have planned downtime, your sysadmins aren't under as much
>> preasure to get things done fast, and so they make fewer mistakes
>> (and don't have to be doing all their work around midnight ;-)
>>
>> the vast majority of services availble today do not meet the
>> criteria I just listed, and they are surprisingly sucessful without
>> doing so
>>
>>
>>
>> managers and CEOs love to talk about how many "9's" of uptime they
>> have. those of us in the trenches know that things are seldom as
>> pretty as they would like everyone to think
>>
>> but even the best planning will occasionally run into problems. take
>> a look at the google outage a few weeks ago, they have a well
>> designed, well tested redundancy plan, but ran into a capacity
>> problem they hadn't anticipated when they took a portion of the
>> servers offline for maintinance.
>>
>>
>> be careful about people who brag too much about their uptime, just
>> like you need to be careful about people who brag too much about
>> their security (remember 'unbreakable' oracle?). you can be good,
>> you can have a solid track record, but you may still be only moments
>> away from a major outage or breech.
>>
>> the name of the game is 'risk management/mitigation/minimization',
>> it isn't 'risk elimination'
>>
>> David Lang
>>
>>>
>>> K I M   N A S H
>>> Senior Editor
>>> 914.962.9661
>>> Email: [email protected]
>>> Twitter: http://twitter.com/knash99
>>> Web: http://www.cio.com/author/127852/Kim+S.+Nash
>>>
>>> -----Original Message-----
>>> From: Esther Schindler [mailto:[email protected]]
>>> Sent: Wednesday, September 16, 2009 7:38 PM
>>> To: [email protected]
>>> Cc: LOPSA Discuss List; Kim Nash
>>> Subject: Re: [lopsa-discuss] easy question (I hope) to help a
>>> journalist (not me)
>>>
>>> On Sep 16, 2009, at 1:54 PM, [email protected] wrote:
>>>> do you include scheduled maintinance time as 'downtime'?
>>>
>>> Good question. I don't know the answer. Hopefully Kim will respond,
>>> though she's not on the list it'd be a private message. I dare say
>>> it'd be okay for you to repost her answer.
>>>
>>
>
>
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] easy question (I hope) to help a journalist (not me)

Reply via email to