The following message is a courtesy copy of an article that has been posted to bit.listserv.ibm-main,alt.folklore.computers as well.
[email protected] (Chris Craddock) writes: > "just when you think you've created a fool-proof system, the universe will > deliver you a superior class of fool" > > human error (both configuration goofs and operational errors) is THE > overwhelming cause of system problems these days. Add in application bugs > and you've pretty much covered the field. Even the squatty boxes rarely if > ever fail these days. People on the other hand... a few years ago ... we had some dealings with one of the large financial networks. they had attributed their 100% availability for an extended number of years to: * IMS hot-standby (triple replicated at geographic distance) * automated operator I recently mentioned that my wife had been con'ed into going to POK to being in charge of loosely-coupled architecture and had done peer-coupled shared data architecture ... other past posts http://www.garlic.com/~lynn/submain.html#shareddata but didn't remain very long in the position because there was little uptake except for IMS hot-standby (until sysplex). with significant improvements in basic hardware ... environmental conditions (like natural disasters) and human mistakes were starting to dominate failure modes. in the early 80s, Jim had done study of system failure modes ... and outages from other than hardware failures were already starting to dominate the statistics. scan of the overview foils: http://www.garlic.com/~lynn/grayft84.pdf some recent posts mentioning above: http://www.garlic.com/~lynn/2009.html#39 repeat after me: RAID != backup http://www.garlic.com/~lynn/2009.html#47 repeat after me: RAID != backup http://www.garlic.com/~lynn/2009.html#65 The 25 Most Dangerous Programming Errors http://www.garlic.com/~lynn/2009p.html#0 big iron mainframe vs. x86 servers http://www.garlic.com/~lynn/2009q.html#26 Check out Computer glitch to cause flight delays across U.S. - MarketWatch http://www.garlic.com/~lynn/2009q.html#28 Check out Computer glitch to cause flight delays across U.S. - MarketWatch http://www.garlic.com/~lynn/2010f.html#68 But... that's *impossible* when we were out marketing our HA/CMP product http://www.garlic.com/~lynn/subtopic.html#hacmp ... I had coined the terms "geographic survivability" and "disaster survivability" http://www.garlic.com/~lynn/submain.html#available and was (also) asked to write a section for the corporate continuous availability strategy document. However, the section got pulled because Rochester & POK complained (at the time, they didn't have any geographic survivability strategy). for other topic drift, reference to Jim and I being keynotes at NASA dependable computing workshop: http://www.hdcc.cs.cmu.edu/may01/index.html -- virtualization experience starting Jan1968, online at home since Mar1970 ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html

