Software recovery of hardware errors? It is quite an opportunity area for IBM to re-publicize, especially for these new kids on the block. I recently had an incredibly amazing discussion with someone who I respect highly, at Oracle (formerly SUN Micro), a very knowledgeable veteran system architect, who was so proud that Solaris had implemented hardware processor storage page-frame recovery in software.
You know, take a 'parity' or 'checking-block code' failure (multiple bit errors, etc), and then take the frame offline, and if the frame was unchanged and backing a pageable page, then you invalidate the page, and a fresh copy, 'the slot' on DASD, gets loaded in. Great stuff. Congratulations, Solaris! Yes, but this was implemented at IBM so darned early, in the 1970s. I documented it (and other things) in an IBM Internal report about 'S370 Machine Checks in MVS', circa 1974 and then plagiarized myself putting it into the very first 'MVS Diagnostic Techniques' manual (a manual, not a Redbook or sales flyer), circa 1976. Yes, old news at IBM. If old mainframe veterans are not so aware of this stuff, then can we expect 'the kids' to know about it? Old, old news at IBM, but there is a fresh audience. And they NEED to hear it. We NEED to tell them! Don't we? What do you think? Dan -----Original Message----- >From: John Gilmore <[email protected]> >Sent: Jan 8, 2014 8:31 AM >To: [email protected] >Subject: Re: Hardware failures (was Re: Scary Sysprogs ...) > >Anecdotage is, I suppose, innocuous; but it would be helpful to make >some distinctions, in particular one between hardware failures and >system failures. > >Hardware failures that are recovered from are moderately frequent, as >everyone who has had occasion to look at SYS1.LOGREC outputs >presumably knows. > >The merit of z/OS and its predecessors is that most such failures are >recovered from without system loss. The system continues to be >available and to do useful work. The hardware is indeed very >reliable, but the machinery for detecting and recovering from hardware >[and some software] errors makes an equally important contribution to >system availability. > >John Gilmore, Ashland, MA 01721 - USA > >---------------------------------------------------------------------- >For IBM-MAIN subscribe / signoff / archive access instructions, >send email to [email protected] with the message: INFO IBM-MAIN Thank you, Dan Dan Skwire home phone 941-378-2383 cell phone 941-400-7632 office phone 941-227-6612 primary email: [email protected] secondary email: [email protected] ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
