Re: Hardware failures (was Re: Scary Sysprogs ...)
Back in the z890 days, we had a CPU fail. Of course, the hardware automatically recovered and we only knew about it due to a logrec record being written and a message on the HMC. We also had one of our OSAs fail. The second OSA did an ARP takeover (proper term?) and we suffered _no_ user interruption. The LAN people _refused_ to believe that the OSA could fail that way without disrupting all the IP sessions of the users on that OSA. Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that NIC die immediately (well, they time out). On Wed, Jan 8, 2014 at 12:52 AM, Elardus Engelbrecht elardus.engelbre...@sita.co.za wrote: Scott Ford wrote: Like Joel, I haven't seen a hardware failure in the Z/OS world since the 70s. Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which caused SMF damage amongst other problems. These channel errors were caused by bad optic cables and some directors/routers. Then last year, during a hardware upgrade, those projects were delayed because an IBM hardware component blew and a part has to be flown in from somewhere... Hardware failures can happens in these days. Groete / Greetings Elardus Engelbrecht -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- Wasn't there something about a PASCAL programmer knowing the value of everything and the Wirth of nothing? Maranatha! John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
Anecdotage is, I suppose, innocuous; but it would be helpful to make some distinctions, in particular one between hardware failures and system failures. Hardware failures that are recovered from are moderately frequent, as everyone who has had occasion to look at SYS1.LOGREC outputs presumably knows. The merit of z/OS and its predecessors is that most such failures are recovered from without system loss. The system continues to be available and to do useful work. The hardware is indeed very reliable, but the machinery for detecting and recovering from hardware [and some software] errors makes an equally important contribution to system availability. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
In caajsdjjarmx2jqj0awzwf1gnptnx_5eysazwro83e2ewsc5...@mail.gmail.com, on 01/08/2014 at 07:16 AM, John McKown john.archie.mck...@gmail.com said: The LAN people _refused_ to believe that the OSA could fail that way without disrupting all the IP sessions of the users on that OSA. That's typical, alas. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see http://patriot.net/~shmuel/resume/brief.html We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
On 08/01/2014, at 9:16 PM, John McKown john.archie.mck...@gmail.com wrote: The LAN people _refused_ to believe that the OSA could fail that way without disrupting all the IP sessions of the users on that OSA. Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that NIC die immediately (well, they time out). Doesn't NIC teaming solve that problem on distributed? -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
Software recovery of hardware errors? It is quite an opportunity area for IBM to re-publicize, especially for these new kids on the block. I recently had an incredibly amazing discussion with someone who I respect highly, at Oracle (formerly SUN Micro), a very knowledgeable veteran system architect, who was so proud that Solaris had implemented hardware processor storage page-frame recovery in software. You know, take a 'parity' or 'checking-block code' failure (multiple bit errors, etc), and then take the frame offline, and if the frame was unchanged and backing a pageable page, then you invalidate the page, and a fresh copy, 'the slot' on DASD, gets loaded in. Great stuff. Congratulations, Solaris! Yes, but this was implemented at IBM so darned early, in the 1970s. I documented it (and other things) in an IBM Internal report about 'S370 Machine Checks in MVS', circa 1974 and then plagiarized myself putting it into the very first 'MVS Diagnostic Techniques' manual (a manual, not a Redbook or sales flyer), circa 1976. Yes, old news at IBM. If old mainframe veterans are not so aware of this stuff, then can we expect 'the kids' to know about it? Old, old news at IBM, but there is a fresh audience. And they NEED to hear it. We NEED to tell them! Don't we? What do you think? Dan -Original Message- From: John Gilmore jwgli...@gmail.com Sent: Jan 8, 2014 8:31 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Hardware failures (was Re: Scary Sysprogs ...) Anecdotage is, I suppose, innocuous; but it would be helpful to make some distinctions, in particular one between hardware failures and system failures. Hardware failures that are recovered from are moderately frequent, as everyone who has had occasion to look at SYS1.LOGREC outputs presumably knows. The merit of z/OS and its predecessors is that most such failures are recovered from without system loss. The system continues to be available and to do useful work. The hardware is indeed very reliable, but the machinery for detecting and recovering from hardware [and some software] errors makes an equally important contribution to system availability. John Gilmore, Ashland, MA 01721 - USA -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN Thank you, Dan Dan Skwire home phone 941-378-2383 cell phone 941-400-7632 office phone 941-227-6612 primary email: dskw...@mindspring.com secondary email: dskw...@gmail.com -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
re: http://www.garlic.com/~lynn/2014.html#23 Scary Sysprogs and educating those 'kids' http://www.garlic.com/~lynn/2014.html#24 Scary Sysprogs and educating those 'kids' after transferring to San Jose Research ... I was allowed to wandering around other locations in the area. One of the places was the disk engineering and development labs. At the time, they had a fare number of IBM mainframes (they would get one of the earliest engineering mainframe processors ... usually #3 or #4 for starting disk testings ... aka needing to test engineering disks ... but also needing to test engineering mainframes with latest disks). At the time the machine rooms were running all the mainframes 7x24 around the clock, stand-alone testing schedules. At one time, they had tried to use MVS to have operating system environment and being able to do multiple concurrent testing ... however in that environment, MVS had 15min MTBF. I offered to rewrite the I/O supervisor to make it bullet proof and never fail ... enabling on-demand, anytime, concurrent testing ... significantly improving disk development productivity. After that I would get sucked into diagnosing lots of development activity ... because frequently anytime there was any kind of problem ... they would accuse the software ... and I would get a call ... and have to figure out what the hardware problem was. old postsing about getting to play disk engineer in bldgs 1415 http://www.garlic.com/~lynn/subtopic.html#disk I did a write up of what was necessary to support the environment and happened to make reference to the MVS 15min MTBF ... which brought down the wrath of the MVS RAS group on my head ... they would have gotten me fired if they could figure out how (I had tried to work with them to improve MVS RAS ... but instead they turned it into an adversary situation). A couple years later ... when 3380s starting to ship ... MVS system was hanging/failing (requiring re-IPL) in the FE 3380 error regression tests (typical errors expected to be found in customer installations) ... and in the majority of the cases, there was not even an indication of what was responsible for the failure (of course I had to be handling them all along ... since nearly all development was being done under systems I provided). old email from the period discussing MVS failures with FE 3380 error regression test: http://www.garlic.com/~lynn/2007.html#email801015 3380s had been announced 11June1980 http://www-03.ibm.com/ibm/history/exhibits/storage/storage_3380.html -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
john.archie.mck...@gmail.com (John McKown) writes: Back in the z890 days, we had a CPU fail. Of course, the hardware automatically recovered and we only knew about it due to a logrec record being written and a message on the HMC. We also had one of our OSAs fail. The second OSA did an ARP takeover (proper term?) and we suffered _no_ user interruption. The LAN people _refused_ to believe that the OSA could fail that way without disrupting all the IP sessions of the users on that OSA. Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that NIC die immediately (well, they time out). re: http://www.garlic.com/~lynn/2014.html#23 Scary Sysprogs and educating those 'kids' http://www.garlic.com/~lynn/2014.html#24 Scary Sysprogs and educating those 'kids' http://www.garlic.com/~lynn/2014.html#27 Hardware failures (was Re: Scary Sysprogs ...) we did IP-address take-over (ARP cache times out and remaps ip-address to a different MAC address) in HA/CMP http://www.garlic.com/~lynn/subtopic.html#hacmp however at the time, most vendors used bsd reno/tahoe 4.3 software for their tcp/ip stack ... and there was a bug in the 4.3 code (and therefor in nearly every machine out there). the bug was in the ip layer ... it saved the previous response from call to ARP cache ... and if the next IP operation was for the same ip-address, it used the saved value (and bypassed calling arp cache handler). ARP cache protocol requires that the saved ip-address/mac-address mapping in the ARP cache times-out and a new ARP operation has to be done to discover the corresponding MAC address (for that ip-address). However, the saved mac address had no such time-out. In a strongly oriented client/server environment when the client primarily does majority of its tcp/ip to the same server (ip-address) ... it could go for long periods of time w/o changing ip-addresses. As a result a server doing ip-address takeover to a different LAN/MAC address wouldn't be noticed by such clients. We had to come up with all sorts of hacks to smear ip-address traffic across the environment ... trying to force clients to reset their ip-address to mac-address mapping. There is separate gimmick which involves MAC-address spoofing ... i.e. in theory every MAC-addresses are unique created at manufacturing time ... however some number of adapters have been given the ability to soft reset their MAC-address (so if one adapter fails ... another adapter can spoof the failed adapter). -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
If you read my original remarks more closely, you would see I did not say I had seen no IBM mainframe hardware failures since the 70's, but that I had not seen any UNDETECTED hardware failures. If the hardware reported no hardware problems, you could pretty well rule that out as a cause of application failure - it had to be a software issue. As others have already remarked, many detected mainframe hardware failures in recent decades resulted in no outages or minimal disruption because of redundancy in the hardware and z/OS recovery. But I have also seen a few classic cases where the hardware built-in diagnostics had much difficulty directing proper repairs because the root cause was something that wasn't expected to break (bad I/O cage in z9 rather than bad cards, intermittent errors from loose bolt connecting large power bus strips that didn't show up until 6 months after a water-cooled processor upgrade). The only hardware issues that occurred with any regularity were issues with tape drives and line printers and these tended to be either media problems or obvious mechanical issues. Several single-drive failures per year were also common in our RAID-5 DASD subsystems, but these were always non disruptive to the data and to z/OS. Joel C. Ewing On 01/08/2014 07:16 AM, John McKown wrote: Back in the z890 days, we had a CPU fail. Of course, the hardware automatically recovered and we only knew about it due to a logrec record being written and a message on the HMC. We also had one of our OSAs fail. The second OSA did an ARP takeover (proper term?) and we suffered _no_ user interruption. The LAN people _refused_ to believe that the OSA could fail that way without disrupting all the IP sessions of the users on that OSA. Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that NIC die immediately (well, they time out). On Wed, Jan 8, 2014 at 12:52 AM, Elardus Engelbrecht elardus.engelbre...@sita.co.za wrote: Scott Ford wrote: Like Joel, I haven't seen a hardware failure in the Z/OS world since the 70s. Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which caused SMF damage amongst other problems. These channel errors were caused by bad optic cables and some directors/routers. Then last year, during a hardware upgrade, those projects were delayed because an IBM hardware component blew and a part has to be flown in from somewhere... Hardware failures can happens in these days. Groete / Greetings Elardus Engelbrecht -- Joel C. Ewing,Bentonville, AR jcew...@acm.org -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
W dniu 2014-01-08 17:23, Joel C. Ewing pisze: If you read my original remarks more closely, you would see I did not say I had seen no IBM mainframe hardware failures since the 70's, but that I had not seen any UNDETECTED hardware failures. If the hardware reported no hardware problems, you could pretty well rule that out as a cause of application failure - it had to be a software issue. OK, I have seen undetected HW failures on mainframe. Example: IBM RAMAC RVA. Disk module was in ? status. Something between OK and NOTOK. Machine reported no bad disks, but the disk was not OK any longer and not in use. Another example: ESCON port. OK, I detected it because CU attached did not work, but root cause wasn't reported. Another one, quite recent (it's microcode issue actually): LPAR cannot be IPLed after z/OS shutdown and Reset Clear. Circumvention: don't use Reset Clear or re-activate LPAR. I believe I can dig in my memory deeper... -- Radoslaw Skorupka Lodz, Poland -- Treść tej wiadomości może zawierać informacje prawnie chronione Banku przeznaczone wyłącznie do użytku służbowego adresata. Odbiorcą może być jedynie jej adresat z wyłączeniem dostępu osób trzecich. Jeżeli nie jesteś adresatem niniejszej wiadomości lub pracownikiem upoważnionym do jej przekazania adresatowi, informujemy, że jej rozpowszechnianie, kopiowanie, rozprowadzanie lub inne działanie o podobnym charakterze jest prawnie zabronione i może być karalne. Jeżeli otrzymałeś tę wiadomość omyłkowo, prosimy niezwłocznie zawiadomić nadawcę wysyłając odpowiedź oraz trwale usunąć tę wiadomość włączając w to wszelkie jej kopie wydrukowane lub zapisane na dysku. This e-mail may contain legally privileged information of the Bank and is intended solely for business use of the addressee. This e-mail may only be received by the addressee and may not be disclosed to any third parties. If you are not the intended addressee of this e-mail or the employee authorized to forward it to the addressee, be advised that any dissemination, copying, distribution or any other similar activity is legally prohibited and may be punishable. If you received this e-mail by mistake please advise the sender immediately by using the reply facility in your e-mail software and delete permanently this e-mail including any copies of it either printed or saved to hard drive. mBank S.A. z siedzibą w Warszawie, ul. Senatorska 18, 00-950 Warszawa, www.mBank.pl, e-mail: kont...@mbank.pl Sąd Rejonowy dla m. st. Warszawy XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr rejestru przedsiębiorców KRS 025237, NIP: 526-021-50-88. Według stanu na dzień 01.01.2013 r. kapitał zakładowy mBanku S.A. (w całości wpłacony) wynosi 168.555.904 złote. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
IIRC the 360/50's didn't have parity checking CPU buss. Long story short CE told me in early 80's CE overtime dropped 50% with intro of 370' and another 50% when 303x's were withdrawn. In a message dated 1/8/2014 10:31:01 A.M. Central Standard Time, r.skoru...@bremultibank.com.pl writes: I believe I can dig in my memory deeper... -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Hardware failures (was Re: Scary Sysprogs ...)
efinnel...@aol.com (Ed Finnell) writes: IIRC the 360/50's didn't have parity checking CPU buss. Long story short CE told me in early 80's CE overtime dropped 50% with intro of 370' and another 50% when 303x's were withdrawn. re: http://www.garlic.com/~lynn/2014.html#23 Scary Sysprogs and educating those 'kids' http://www.garlic.com/~lynn/2014.html#24 Scary Sysprogs and educating those 'kids' http://www.garlic.com/~lynn/2014.html#27 Hardware failures (was Re: Scary Sysprogs ...) http://www.garlic.com/~lynn/2014.html#29 Hardware failures (was Re: Scary Sysprogs ...) 303x's were mostly 370s. they took the integrated channel microcode from the 370/158 and created the 303x channel director (158 microcode engine with just the integrated channel microcode and w/o the 370 microcode). a 3031 was a 370/158 engine with the 370 microcode (and w/o the integrated channel microcde) and a 2nd (channel director) 370/158 engine with the integrated channel microcode (and w/o the 370 microcode). a 3032 was a 370/168 reconfigured to work with channel director a 3033 started out being 370/168 logic mapped to 20% faster chips ... some other optimization eventually got it up to about 50% faster than 168. CE had machine diagnostic service process that required being able to scope. The 3081 had chips packaged inside TCM (thermal conduction module) and couldn't be scoped. To support CE service process, the TCMs had a bunch of probes connected to a service processor. CEs then had (bootstrap) diiagnostic service process that could diagnose/scope a failing service processor ... and then use a working service processor to diagnose the rest of the machine. TCM http://www-03.ibm.com/ibm/history/exhibits/vintage/vintage_4506VV2137.html and http://en.wikipedia.org/wiki/Thermal_Conduction_Module#Mainframes_and_supercomputers other comments about 3033 3081 ... being part of the qd effort to get machines back into the product pipeline after the failure of the Future System effort: http://www.jfsowa.com/computer/memo125.htm the 3090 started out with 4331 running a highly modified version of release 6 vm370/cms as the service processor (with all the menu screens done in cms ios3270). This was upgraded to a pair of 4361s (with probes into TCMs for diagnosing problems). reference to 3092 (service controller) needing a pair of 3370 fixed-block architecture disks i.e. the system disks for the vm/4361s (aka even for a pure MVS installation ... where MVS never had any 3370/FBA support) http://www-03.ibm.com/ibm/history/exhibits/mainframe/mainframe_PP3090.html more ... although following says 3090 in 1984 ... but 3090 wasn't announced until feb 1985 (see above): http://en.wikipedia.org/wiki/Thermal_Conduction_Module#Mainframes_and_supercomputers old email mentioning 3092 http://www.garlic.com/~lynn/2010e.html#email861031 http://www.garlic.com/~lynn/2010e.html#email861223 -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Hardware failures (was Re: Scary Sysprogs ...)
Scott Ford wrote: Like Joel, I haven't seen a hardware failure in the Z/OS world since the 70s. Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which caused SMF damage amongst other problems. These channel errors were caused by bad optic cables and some directors/routers. Then last year, during a hardware upgrade, those projects were delayed because an IBM hardware component blew and a part has to be flown in from somewhere... Hardware failures can happens in these days. Groete / Greetings Elardus Engelbrecht -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN