[email protected] (Charles Mills) writes: > It is hard to prepare for unknown unknowns. It is legendary that people have > had recovery failures because the fallover switch (channel, power, network, > whatever) failed.
re: http://www.garlic.com/~lynn/2017c.html#13 Check out Massive Amazon cloud service outage disrupts sites unks unks are frequently not having done detailed end-to-end evaluations of all possible scenarios. we started out claiming no-single-point-of-failure ... but it required doing end-to-end walk through looking for all sort of critical components (I was also asked to review IBM RAID designs and sometimes would uncover single-point-of-failure in the least anticipated places). This included replicated fallover switches and process over precedence of failover (as part of handling some race conditions). Also needed a inverse "RESERVE" ... there is a failure case where a processor gets suspended just before a write operation, which then kicks-off recovery processes. The processor that is assumed to have failed has to be "fenced off" from proceeding with a write operation when it wakes up (aka RESERVE allows only one processor to write and prevents all other, inverse "RESERVE" blocks one or more identified processors from writing; there also has to be tie-breaker process for race conditions). "real" no-signel-point-of-failures contributed to having to specify geographically separated operation. we also started defining what was needed to handle multiple points of failure ... and looking at 5-nines availability configurations. http://www.garlic.com/~lynn/submain.html#available As undergraduate at the univ ... I was first hired as fulltime person to be responsible for IBM production mainframe systems. Then before graduation, I was hired fulltime by Boeing to help with creation of Boeing Computer Serivces (consolidate all dataprocessing in an independent business unit to better monetize the investment, including offering services to non-Boeing entities). I thot Renton datacenter was possibly largest in the world with something like $300M (late 60s dollars) in ibm mainframes (360/65s were arriving faster than they could be installed). 747#3 was flying skies of seattle getting FAA flt. certification. There was also decision to replicate Renton up at the new 747 plant in Everett ... there was disaster scenario where Mt. Rainier heats up and the resulting mudslide takes out the Renton datacenter). I finally join IBM (science center) after graduation ... some past posts http://www.garlic.com/~lynn/subtopic.html#545tech One of the things that was exposed in the 70s with respect to IBM dasd ... was IBM mainframe channel/dasd had (for some time, up through 80s) a undetectable power interruption failure mode in the middle of write operation ... control/dasd had sufficient power to complete write correctly, but there wasn't sufficient power to transfer data from processor memory ... so the record write was completed with all zeros with valid error correcting codes. In CMS case, MFD is somewhat equivalent of OS VTOC, change was made to have pairs of alternating MFD records, with sequence appended. A power-interrupted MFD write would zero all or part of appended sequence ... so it wouldn't appear most current during recovery (and the other MFD would be used). Towards the mid-80s there was controller work to try and handle the case (for operating systems that didn't know how). Later hardware solution was that all the data had to be available for the write to start. later they let me play disk engineer in bldgs 14&15 ... some past posts http://www.garlic.com/~lynn/subtopic.html#disk they had bunch of mainframes for dasd engineering testing that were scheduled stand-alone 7x24 around the clock. They had once tried to do testing under MVS ... but in that environment, MVS had 15min MTBF, requiring manual re-ipl. I offerred to redo input/output supervisor that was bullet proof and never fail ... greatly improving productivity, allowing anytime, on-demand concurrent testing. When I wrote up the wrote in an internal report, I may have made a mistake mentioning the MVS 15min MTBF ... because I was later told that the MVS RAS group did their best to have me separated from the company. A couple years later ... field engineering had 3880 controller error regression test with 57 "injected" errors (that they considered typical and likely to occur). MVS was failing in all 57 cases (requiring manual re-ipl) ... and in 2/3rds of the cases, no indiciation of what was responsible for the failure ... previously posted old email http://www.garlic.com/~lynn/2007.html#email801015 trivia: I had worked with Jim Gray at IBM SJR ... before he left for Tandem. At Tandem he does a detailed analysis of failure modes, finding that hardware was in the process of becoming significantly more reliable ... and failures were starting to shift to human error, software bugs, and environmental (power, acts of nature, etc) ... copy of summary from that study http://www.garlic.com/~lynn/grayft84.pdf -- virtualization experience starting Jan1968, online at home since Mar1970 ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
