The following message is a courtesy copy of an article
that has been posted to bit.listserv.ibm-main,alt.folklore.computers as well.


[email protected] (Chris Craddock) writes:
> "just when you think you've created a fool-proof system, the universe will
> deliver you a superior class of fool"
>
> human error (both configuration goofs and operational errors) is THE
> overwhelming cause of system problems these days. Add in application bugs
> and you've pretty much covered the field. Even the squatty boxes rarely if
> ever fail these days. People on the other hand...

a few years ago ... we had some dealings with one of the large financial
networks. they had attributed their 100% availability for an extended
number of years to:

* IMS hot-standby (triple replicated at geographic distance)
* automated operator

I recently mentioned that my wife had been con'ed into going to POK to
being in charge of loosely-coupled architecture and had done
peer-coupled shared data architecture ... other past posts
http://www.garlic.com/~lynn/submain.html#shareddata

but didn't remain very long in the position because there was little
uptake except for IMS hot-standby (until sysplex).

with significant improvements in basic hardware ... environmental
conditions (like natural disasters) and human mistakes were starting to
dominate failure modes.

in the early 80s, Jim had done study of system failure modes ... and
outages from other than hardware failures were already starting to
dominate the statistics. scan of the overview foils:
http://www.garlic.com/~lynn/grayft84.pdf

some recent posts mentioning above:
http://www.garlic.com/~lynn/2009.html#39 repeat after me:  RAID != backup
http://www.garlic.com/~lynn/2009.html#47 repeat after me:  RAID != backup
http://www.garlic.com/~lynn/2009.html#65 The 25 Most Dangerous Programming 
Errors
http://www.garlic.com/~lynn/2009p.html#0 big iron mainframe vs. x86 servers
http://www.garlic.com/~lynn/2009q.html#26 Check out Computer glitch to cause 
flight delays across U.S. - MarketWatch
http://www.garlic.com/~lynn/2009q.html#28 Check out Computer glitch to cause 
flight delays across U.S. - MarketWatch
http://www.garlic.com/~lynn/2010f.html#68 But... that's *impossible*

when we were out marketing our HA/CMP product 
http://www.garlic.com/~lynn/subtopic.html#hacmp

... I had coined the terms "geographic survivability" and "disaster
survivability"
http://www.garlic.com/~lynn/submain.html#available

and was (also) asked to write a section for the corporate continuous
availability strategy document. However, the section got pulled because
Rochester & POK complained (at the time, they didn't have any geographic
survivability strategy).

for other topic drift, reference to Jim and I being keynotes at NASA
dependable computing workshop:
http://www.hdcc.cs.cmu.edu/may01/index.html

-- 
virtualization experience starting Jan1968, online at home since Mar1970

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to