Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread John McKown
Back in the z890 days, we had a CPU fail. Of course, the hardware
automatically recovered and we only knew about it due to a logrec record
being written and a message on the HMC. We also had one of our OSAs fail.
The second OSA did an ARP takeover (proper term?) and we suffered _no_ user
interruption. The LAN people _refused_ to believe that the OSA could fail
that way without disrupting all the IP sessions of the users on that OSA.
Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that
NIC die immediately (well, they time out).


On Wed, Jan 8, 2014 at 12:52 AM, Elardus Engelbrecht 
elardus.engelbre...@sita.co.za wrote:

 Scott Ford wrote:

 Like Joel, I haven't seen a hardware failure in the Z/OS world since the
 70s.

 Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which
 caused SMF damage amongst other problems.

 These channel errors were caused by bad optic cables and some
 directors/routers.

 Then last year, during a hardware upgrade, those projects were delayed
 because an IBM hardware component blew and a part has to be flown in from
 somewhere...

 Hardware failures can happens in these days.

 Groete / Greetings
 Elardus Engelbrecht

 --
 For IBM-MAIN subscribe / signoff / archive access instructions,
 send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN




-- 
Wasn't there something about a PASCAL programmer knowing the value of
everything and the Wirth of nothing?

Maranatha! 
John McKown

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread John Gilmore
Anecdotage is, I suppose, innocuous; but it would be helpful to make
some distinctions, in particular one between hardware failures and
system failures.

Hardware failures that are recovered from are moderately frequent, as
everyone who has had occasion to look at SYS1.LOGREC outputs
presumably knows.

The merit of z/OS and its predecessors is that most such failures are
recovered from without system loss.  The system continues to be
available and to do useful work.  The hardware is indeed very
reliable, but the machinery for detecting and recovering from hardware
[and some software] errors makes an equally important contribution to
system availability.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Shmuel Metz (Seymour J.)
In
caajsdjjarmx2jqj0awzwf1gnptnx_5eysazwro83e2ewsc5...@mail.gmail.com,
on 01/08/2014
   at 07:16 AM, John McKown john.archie.mck...@gmail.com said:

The LAN people _refused_ to believe that the OSA could fail that way
without disrupting all the IP sessions of the users on that OSA.

That's typical, alas.
 
-- 
 Shmuel (Seymour J.) Metz, SysProg and JOAT
 ISO position; see http://patriot.net/~shmuel/resume/brief.html 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread David Crayford
On 08/01/2014, at 9:16 PM, John McKown john.archie.mck...@gmail.com wrote:

 The LAN people _refused_ to believe that the OSA could fail
 that way without disrupting all the IP sessions of the users on that OSA.
 Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that
 NIC die immediately (well, they time out).

Doesn't NIC teaming solve that problem on distributed?

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Dan Skwire
Software recovery of hardware errors?

It is quite an opportunity area for IBM to re-publicize, especially for these 
new kids on the block. I recently had an incredibly amazing discussion with 
someone who I respect highly, at Oracle (formerly SUN Micro), a very 
knowledgeable veteran system architect, who was so proud that Solaris had 
implemented hardware processor storage page-frame recovery in software. 

You know, take a 'parity' or 'checking-block code' failure (multiple bit 
errors, etc), and then take the frame offline, and if the frame was unchanged 
and backing a pageable page, then you invalidate the page, and a fresh copy, 
'the slot' on DASD, gets loaded in. Great stuff. Congratulations, Solaris! 

Yes, but this was implemented at IBM so darned early, in the 1970s. I 
documented it (and other things) in an IBM Internal report about 'S370 Machine 
Checks in MVS', circa 1974 and then plagiarized myself putting it into the very 
first 'MVS Diagnostic Techniques' manual (a manual, not a Redbook or sales 
flyer), circa 1976.

Yes, old news at IBM.

If old mainframe veterans are not so aware of this stuff, then can we expect 
'the kids' to know about it? Old, old news at IBM, but there is a fresh 
audience. And they NEED to hear it.
We NEED to tell them!

Don't we? What do you think?

Dan

-Original Message-
From: John Gilmore jwgli...@gmail.com
Sent: Jan 8, 2014 8:31 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: Hardware failures (was Re: Scary Sysprogs ...)

Anecdotage is, I suppose, innocuous; but it would be helpful to make
some distinctions, in particular one between hardware failures and
system failures.

Hardware failures that are recovered from are moderately frequent, as
everyone who has had occasion to look at SYS1.LOGREC outputs
presumably knows.

The merit of z/OS and its predecessors is that most such failures are
recovered from without system loss.  The system continues to be
available and to do useful work.  The hardware is indeed very
reliable, but the machinery for detecting and recovering from hardware
[and some software] errors makes an equally important contribution to
system availability.

John Gilmore, Ashland, MA 01721 - USA

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Thank you,

Dan 

Dan Skwire
home phone 941-378-2383
cell phone 941-400-7632
office phone 941-227-6612
primary email: dskw...@mindspring.com
secondary email: dskw...@gmail.com

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Anne Lynn Wheeler
re:
http://www.garlic.com/~lynn/2014.html#23 Scary Sysprogs and educating those 
'kids'
http://www.garlic.com/~lynn/2014.html#24 Scary Sysprogs and educating those 
'kids'

after transferring to San Jose Research ... I was allowed to wandering
around other locations in the area. One of the places was the disk
engineering and development labs. At the time, they had a fare number of
IBM mainframes (they would get one of the earliest engineering mainframe
processors ... usually #3 or #4 for starting disk testings ... aka
needing to test engineering disks ... but also needing to test
engineering mainframes with latest disks). At the time the machine rooms
were running all the mainframes 7x24 around the clock, stand-alone
testing schedules. At one time, they had tried to use MVS to have
operating system environment and being able to do multiple concurrent
testing ... however in that environment, MVS had 15min MTBF.

I offered to rewrite the I/O supervisor to make it bullet proof and
never fail ... enabling on-demand, anytime, concurrent testing
... significantly improving disk development productivity. After that I
would get sucked into diagnosing lots of development activity
... because frequently anytime there was any kind of problem ... they
would accuse the software ... and I would get a call ... and have to
figure out what the hardware problem was. old postsing about getting
to play disk engineer in bldgs 1415
http://www.garlic.com/~lynn/subtopic.html#disk

I did a write up of what was necessary to support the environment and
happened to make reference to the MVS 15min MTBF ... which brought down
the wrath of the MVS RAS group on my head ... they would have gotten me
fired if they could figure out how (I had tried to work with them to
improve MVS RAS ... but instead they turned it into an adversary
situation).

A couple years later ... when 3380s starting to ship ... MVS system
was hanging/failing (requiring re-IPL) in the FE 3380 error regression
tests (typical errors expected to be found in customer installations)
... and in the majority of the cases, there was not even an indication
of what was responsible for the failure (of course I had to be handling
them all along ... since nearly all development was being done under
systems I provided). old email from the period discussing MVS failures
with FE 3380 error regression test:
http://www.garlic.com/~lynn/2007.html#email801015

3380s had been announced 11June1980
http://www-03.ibm.com/ibm/history/exhibits/storage/storage_3380.html

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Anne Lynn Wheeler
john.archie.mck...@gmail.com (John McKown) writes:
 Back in the z890 days, we had a CPU fail. Of course, the hardware
 automatically recovered and we only knew about it due to a logrec record
 being written and a message on the HMC. We also had one of our OSAs fail.
 The second OSA did an ARP takeover (proper term?) and we suffered _no_ user
 interruption. The LAN people _refused_ to believe that the OSA could fail
 that way without disrupting all the IP sessions of the users on that OSA.
 Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that
 NIC die immediately (well, they time out).

re:
http://www.garlic.com/~lynn/2014.html#23 Scary Sysprogs and educating those 
'kids'
http://www.garlic.com/~lynn/2014.html#24 Scary Sysprogs and educating those 
'kids'
http://www.garlic.com/~lynn/2014.html#27 Hardware failures (was Re: Scary 
Sysprogs ...)

we did IP-address take-over (ARP cache times out and remaps ip-address
to a different MAC address) in HA/CMP
http://www.garlic.com/~lynn/subtopic.html#hacmp

however at the time, most vendors used bsd reno/tahoe 4.3 software for
their tcp/ip stack ... and there was a bug in the 4.3 code (and
therefor in nearly every machine out there).

the bug was in the ip layer ... it saved the previous response from call
to ARP cache ... and if the next IP operation was for the same
ip-address, it used the saved value (and bypassed calling arp cache
handler). ARP cache protocol requires that the saved
ip-address/mac-address mapping in the ARP cache times-out and a new ARP
operation has to be done to discover the corresponding MAC address (for
that ip-address). However, the saved mac address had no such time-out.

In a strongly oriented client/server environment when the client
primarily does majority of its tcp/ip to the same server (ip-address)
... it could go for long periods of time w/o changing ip-addresses. As a
result a server doing ip-address takeover to a different LAN/MAC address
wouldn't be noticed by such clients. We had to come up with all sorts of
hacks to smear ip-address traffic across the environment ... trying to
force clients to reset their ip-address to mac-address mapping.

There is separate gimmick which involves MAC-address spoofing ... i.e.
in theory every MAC-addresses are unique created at manufacturing time
... however some number of adapters have been given the ability to soft
reset their MAC-address (so if one adapter fails ... another adapter can
spoof the failed adapter).

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Joel C. Ewing
If you read my original remarks more closely, you would see I did not
say I had seen no IBM mainframe hardware failures since the 70's, but
that I had not seen any  UNDETECTED hardware failures.  If the hardware
reported no hardware problems, you could pretty well rule that out as a
cause of application failure - it had to be a software issue.

As others have already remarked, many detected mainframe hardware
failures in recent decades resulted in no outages or minimal disruption
because of redundancy in the hardware and z/OS recovery.  But I have
also seen a few classic cases where the hardware built-in diagnostics
had much difficulty directing proper repairs because the root cause was
something that wasn't expected to break (bad I/O cage in z9 rather than
bad cards, intermittent errors from loose bolt connecting large power
bus strips that didn't show up until 6 months after a water-cooled
processor upgrade).

The only hardware issues that occurred with any regularity were issues
with tape drives and line printers and these tended to be either media
problems or obvious mechanical issues.  Several single-drive failures
per year were also common in our RAID-5 DASD subsystems, but these were
always non disruptive to the data and to z/OS.
Joel C. Ewing

On 01/08/2014 07:16 AM, John McKown wrote:
 Back in the z890 days, we had a CPU fail. Of course, the hardware
 automatically recovered and we only knew about it due to a logrec record
 being written and a message on the HMC. We also had one of our OSAs fail.
 The second OSA did an ARP takeover (proper term?) and we suffered _no_ user
 interruption. The LAN people _refused_ to believe that the OSA could fail
 that way without disrupting all the IP sessions of the users on that OSA.
 Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that
 NIC die immediately (well, they time out).
 
 
 On Wed, Jan 8, 2014 at 12:52 AM, Elardus Engelbrecht 
 elardus.engelbre...@sita.co.za wrote:
 
 Scott Ford wrote:

 Like Joel, I haven't seen a hardware failure in the Z/OS world since the
 70s.

 Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which
 caused SMF damage amongst other problems.

 These channel errors were caused by bad optic cables and some
 directors/routers.

 Then last year, during a hardware upgrade, those projects were delayed
 because an IBM hardware component blew and a part has to be flown in from
 somewhere...

 Hardware failures can happens in these days.

 Groete / Greetings
 Elardus Engelbrecht


 
 


-- 
Joel C. Ewing,Bentonville, AR   jcew...@acm.org 

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread R.S.

W dniu 2014-01-08 17:23, Joel C. Ewing pisze:

If you read my original remarks more closely, you would see I did not
say I had seen no IBM mainframe hardware failures since the 70's, but
that I had not seen any  UNDETECTED hardware failures.  If the hardware
reported no hardware problems, you could pretty well rule that out as a
cause of application failure - it had to be a software issue.


OK, I have seen undetected HW failures on mainframe.
Example: IBM RAMAC RVA. Disk module was in ? status. Something between 
OK and NOTOK. Machine reported no bad disks, but the disk was not OK any 
longer and not in use.
Another example: ESCON port. OK, I detected it because CU attached did 
not work, but root cause wasn't reported.


Another one, quite recent (it's microcode issue actually): LPAR cannot 
be IPLed after z/OS shutdown and Reset Clear. Circumvention: don't use 
Reset Clear or re-activate LPAR.


I believe I can dig in my memory deeper...

--
Radoslaw Skorupka
Lodz, Poland






--
Treść tej wiadomości może zawierać informacje prawnie chronione Banku 
przeznaczone wyłącznie do użytku służbowego adresata. Odbiorcą może być jedynie 
jej adresat z wyłączeniem dostępu osób trzecich. Jeżeli nie jesteś adresatem 
niniejszej wiadomości lub pracownikiem upoważnionym do jej przekazania 
adresatowi, informujemy, że jej rozpowszechnianie, kopiowanie, rozprowadzanie 
lub inne działanie o podobnym charakterze jest prawnie zabronione i może być 
karalne. Jeżeli otrzymałeś tę wiadomość omyłkowo, prosimy niezwłocznie 
zawiadomić nadawcę wysyłając odpowiedź oraz trwale usunąć tę wiadomość 
włączając w to wszelkie jej kopie wydrukowane lub zapisane na dysku.

This e-mail may contain legally privileged information of the Bank and is 
intended solely for business use of the addressee. This e-mail may only be 
received by the addressee and may not be disclosed to any third parties. If you 
are not the intended addressee of this e-mail or the employee authorized to 
forward it to the addressee, be advised that any dissemination, copying, 
distribution or any other similar activity is legally prohibited and may be 
punishable. If you received this e-mail by mistake please advise the sender 
immediately by using the reply facility in your e-mail software and delete 
permanently this e-mail including any copies of it either printed or saved to 
hard drive.

mBank S.A. z siedzibą w Warszawie, ul. Senatorska 18, 00-950 Warszawa, www.mBank.pl, e-mail: kont...@mbank.pl 
Sąd Rejonowy dla m. st. Warszawy XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr rejestru przedsiębiorców KRS 025237, NIP: 526-021-50-88. Według stanu na dzień 01.01.2013 r. kapitał zakładowy mBanku S.A. (w całości wpłacony) wynosi 168.555.904 złote.



--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Ed Finnell
IIRC the 360/50's didn't have parity checking CPU buss. Long story short CE 
 told me in early 80's CE overtime dropped 50% with intro of 370' and 
another 50%  when 303x's were withdrawn.
 
 
In a message dated 1/8/2014 10:31:01 A.M. Central Standard Time,  
r.skoru...@bremultibank.com.pl writes:

I  believe I can dig in my memory  deeper...


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: Hardware failures (was Re: Scary Sysprogs ...)

2014-01-08 Thread Anne Lynn Wheeler
efinnel...@aol.com (Ed Finnell) writes:
 IIRC the 360/50's didn't have parity checking CPU buss. Long story short CE 
  told me in early 80's CE overtime dropped 50% with intro of 370' and 
 another 50%  when 303x's were withdrawn.

re:
http://www.garlic.com/~lynn/2014.html#23 Scary Sysprogs and educating those 
'kids'
http://www.garlic.com/~lynn/2014.html#24 Scary Sysprogs and educating those 
'kids'
http://www.garlic.com/~lynn/2014.html#27 Hardware failures (was Re: Scary 
Sysprogs ...)
http://www.garlic.com/~lynn/2014.html#29 Hardware failures (was Re: Scary 
Sysprogs ...)

303x's were mostly 370s. they took the integrated channel microcode from
the 370/158 and created the 303x channel director (158 microcode engine
with just the integrated channel microcode and w/o the 370 microcode).

a 3031 was a 370/158 engine with the 370 microcode (and w/o the
integrated channel microcde) and a 2nd (channel director) 370/158 engine
with the integrated channel microcode (and w/o the 370 microcode).

a 3032 was a 370/168 reconfigured to work with channel director

a 3033 started out being 370/168 logic mapped to 20% faster chips ...
some other optimization eventually got it up to about 50% faster than
168.

CE had machine diagnostic service process that required being able to
scope. The 3081 had chips packaged inside TCM (thermal conduction
module) and couldn't be scoped. To support CE service process, the TCMs
had a bunch of probes connected to a service processor. CEs then had
(bootstrap) diiagnostic service process that could diagnose/scope a
failing service processor ... and then use a working service processor
to diagnose the rest of the machine. TCM
http://www-03.ibm.com/ibm/history/exhibits/vintage/vintage_4506VV2137.html
and
http://en.wikipedia.org/wiki/Thermal_Conduction_Module#Mainframes_and_supercomputers

other comments about 3033  3081 ... being part of the qd effort
to get machines back into the product pipeline after the failure
of the Future System effort:
http://www.jfsowa.com/computer/memo125.htm

the 3090 started out with 4331 running a highly modified version of
release 6 vm370/cms as the service processor (with all the menu screens
done in cms ios3270). This was upgraded to a pair of 4361s (with probes
into TCMs for diagnosing problems). reference to 3092 (service
controller) needing a pair of 3370 fixed-block architecture disks 
i.e. the system disks for the vm/4361s (aka even for a pure MVS
installation ... where MVS never had any 3370/FBA support)
http://www-03.ibm.com/ibm/history/exhibits/mainframe/mainframe_PP3090.html

more ... although following says 3090 in 1984 ... but 3090 wasn't
announced until feb 1985 (see above):
http://en.wikipedia.org/wiki/Thermal_Conduction_Module#Mainframes_and_supercomputers

old email mentioning 3092
http://www.garlic.com/~lynn/2010e.html#email861031
http://www.garlic.com/~lynn/2010e.html#email861223

-- 
virtualization experience starting Jan1968, online at home since Mar1970

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Hardware failures (was Re: Scary Sysprogs ...)

2014-01-07 Thread Elardus Engelbrecht
Scott Ford wrote:

Like Joel, I haven't seen a hardware failure in the Z/OS world since the 70s. 

Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which 
caused SMF damage amongst other problems.

These channel errors were caused by bad optic cables and some directors/routers.

Then last year, during a hardware upgrade, those projects were delayed because 
an IBM hardware component blew and a part has to be flown in from somewhere...

Hardware failures can happens in these days.

Groete / Greetings
Elardus Engelbrecht

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN