[email protected] (John McKown) writes: > Yes, we have had a TCM fail. I was almost called a liar when I told the > Windows people that the z simply switch the work transparently (on the > hardware level) to another CP. They were shocked and amazed that we could > "hot swap" a new TCM into the box without any outage. The same thing when > an OSA failed. The other OSA simply did an "ARP rollover" and there were > not any outages. And that, again, IBM replaced the OSA "hot" and we simply > started using it. All automatically. But the Windows people still chant > "Windows is BETTER than the mainframe."
I was keynote speaker at NASA dependable computing workshop (along with Jim Gray, who I worked with at IBM SJR, but he had gone on to Tandem, Dec, and then Microsoft) ... reference gone 404 but lives on at wayback machine http://web.archive.org/web/20011004023230/http://www.hdcc.cs.cmu.edu/may01/index.html and told mainframe story I had done this software support for channel extender ... allowing local controllers & devices to operate at the end of some telco link. for various reasons, i had chosen to simulate "channel check" when various telco errors occurred ... in order to kick-off various operating system recovery/retry routines. along came the 3090 ... which was designed to have something like 3-5 channel check errors per annum (not per annum per machine ... but per annum across all machines). After 3090s had been out a year ... R-something? was reporting that there had been an aggregate of something like 15-20 channel check errors in the first year across all machines .... which launched a detailed audit of what had gone wrong. they finally found me ... and after a little bit additional investigation, i decided that for all intents and purposes, simulating an IFCC (interface control check) instead of a CC (channel check) would do as well from the standpoint of the error retry/recovery procedures activated. ... snip ... majority of audience didn't even understand that errors & faults were being recorded, tracked, collected, trends, etc. I had done the support in 1980 for STL, which was bursting at the seams and were moving 300 from the IMS group to offsite bldg. with dataprocessing back to STL. They had tried remote 3270, but found human factors totally unacceptable. Channel-extender support allowd local channel attached controllers at the offsite bldg ... and the human factors were same offsite as local in STL. Actually the STL POK mainframes supporting the offsite bldg ran faster ... turns out 3270 controllers had lots of excessive channel busy ... the channel-extender significantly reduced that 3270 controller channel busy ... moving it all to the interface at the offsite bldg. Hardware vendor had tried to get IBM to release my software, but there was group in POK that were playing with some serial stuff and got it vetoed (they were afraid that if it was in the market, it would make it harder to get their stuff released). The vendor then had to (exactly) duplicate my support from scratch (including reflecting CC on errors). I then get them to change their implementation from CC to IFCC. trivia: in 1988, I was asked to help LLNL standardize some stuff they were playing with ... which quickly becomes fibre channel standard (including some stuff I had done in 1980). The POK people finally get their stuff released in 1990 with ES/9000 as ESCON when it was already obsolute. Later POK people become involved in fibre channel standard and define a heavy-weight protocol that radically reduces the native throughput, which eventually ships as FICON. Our last product at IBM was HA/CMP and after leaving IBM we were bought into the financial institution that had implemented the original magstripe merchant/gift cards ... on a SUN 2-way "HA" platform. Turns out SUN had implemented/copied my HA/CMP design ... even copying my marketing pitches. System had failure and "fell over" and continued working with no outage. SUN replaced failed component but CE forgot to update configuration with the identifier for the new component ... so it wasn't actually being used. Three months later when they had 2nd failure, they found that parts of the DBMS records weren't actually being written/replicated (more than "no single point of failure", three problems, original failure, failure to update configuration info, 2nd failure). earlier HA/CMP reference/post in this thread http://www.garlic.com/~lynn/2019c.html#11 -- virtualization experience starting Jan1968, online at home since Mar1970 ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
