Re: Hardware failures (was Re: Scary Sysprogs ...)

Joel C. Ewing Wed, 08 Jan 2014 08:28:44 -0800

If you read my original remarks more closely, you would see I did not
say I had seen no IBM mainframe hardware failures since the 70's, but
that I had not seen any  UNDETECTED hardware failures.  If the hardware
reported no hardware problems, you could pretty well rule that out as a
cause of application failure - it had to be a software issue.


As others have already remarked, many detected mainframe hardware
failures in recent decades resulted in no outages or minimal disruption
because of redundancy in the hardware and z/OS recovery.  But I have
also seen a few classic cases where the hardware built-in diagnostics
had much difficulty directing proper repairs because the root cause was
something that wasn't expected to break (bad I/O cage in z9 rather than
bad cards, intermittent errors from loose bolt connecting large power
bus strips that didn't show up until 6 months after a water-cooled
processor upgrade).

The only hardware issues that occurred with any regularity were issues
with tape drives and line printers and these tended to be either media
problems or obvious mechanical issues.  Several single-drive failures
per year were also common in our RAID-5 DASD subsystems, but these were
always non disruptive to the data and to z/OS.
        Joel C. Ewing

On 01/08/2014 07:16 AM, John McKown wrote:
> Back in the z890 days, we had a CPU fail. Of course, the hardware
> automatically recovered and we only knew about it due to a logrec record
> being written and a message on the HMC. We also had one of our OSAs fail.
> The second OSA did an ARP takeover (proper term?) and we suffered _no_ user
> interruption. The LAN people _refused_ to believe that the OSA could fail
> that way without disrupting all the IP sessions of the users on that OSA.
> Apparently when a multi-NIC PC has a NIC fail, all the IP sessions on that
> NIC die immediately (well, they time out).
> 
> 
> On Wed, Jan 8, 2014 at 12:52 AM, Elardus Engelbrecht <
> [email protected]> wrote:
> 
>> Scott Ford wrote:
>>
>>> Like Joel, I haven't seen a hardware failure in the Z/OS world since the
>> 70s.
>>
>> Lucky you. I wrote on IBM-MAIN in May/June 2013 about Channel Errors which
>> caused SMF damage amongst other problems.
>>
>> These channel errors were caused by bad optic cables and some
>> directors/routers.
>>
>> Then last year, during a hardware upgrade, those projects were delayed
>> because an IBM hardware component blew and a part has to be flown in from
>> somewhere...
>>
>> Hardware failures can happens in these days.
>>
>> Groete / Greetings
>> Elardus Engelbrecht
>>

> 
> 


-- 
Joel C. Ewing,    Bentonville, AR       [email protected] 

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Hardware failures (was Re: Scary Sysprogs ...)

Reply via email to