Software recovery of hardware errors?

It is quite an opportunity area for IBM to re-publicize, especially for these 
new kids on the block. I recently had an incredibly amazing discussion with 
someone who I respect highly, at Oracle (formerly SUN Micro), a very 
knowledgeable veteran system architect, who was so proud that Solaris had 
implemented hardware processor storage page-frame recovery in software. 

You know, take a 'parity' or 'checking-block code' failure (multiple bit 
errors, etc), and then take the frame offline, and if the frame was unchanged 
and backing a pageable page, then you invalidate the page, and a fresh copy, 
'the slot' on DASD, gets loaded in. Great stuff. Congratulations, Solaris! 

Yes, but this was implemented at IBM so darned early, in the 1970s. I 
documented it (and other things) in an IBM Internal report about 'S370 Machine 
Checks in MVS', circa 1974 and then plagiarized myself putting it into the very 
first 'MVS Diagnostic Techniques' manual (a manual, not a Redbook or sales 
flyer), circa 1976.

Yes, old news at IBM.

If old mainframe veterans are not so aware of this stuff, then can we expect 
'the kids' to know about it? Old, old news at IBM, but there is a fresh 
audience. And they NEED to hear it.
We NEED to tell them!

Don't we? What do you think?

Dan

-----Original Message-----
>From: John Gilmore <[email protected]>
>Sent: Jan 8, 2014 8:31 AM
>To: [email protected]
>Subject: Re: Hardware failures (was Re: Scary Sysprogs ...)
>
>Anecdotage is, I suppose, innocuous; but it would be helpful to make
>some distinctions, in particular one between hardware failures and
>system failures.
>
>Hardware failures that are recovered from are moderately frequent, as
>everyone who has had occasion to look at SYS1.LOGREC outputs
>presumably knows.
>
>The merit of z/OS and its predecessors is that most such failures are
>recovered from without system loss.  The system continues to be
>available and to do useful work.  The hardware is indeed very
>reliable, but the machinery for detecting and recovering from hardware
>[and some software] errors makes an equally important contribution to
>system availability.
>
>John Gilmore, Ashland, MA 01721 - USA
>
>----------------------------------------------------------------------
>For IBM-MAIN subscribe / signoff / archive access instructions,
>send email to [email protected] with the message: INFO IBM-MAIN


Thank you,

Dan 

Dan Skwire
home phone 941-378-2383
cell phone 941-400-7632
office phone 941-227-6612
primary email: [email protected]
secondary email: [email protected]

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to