Hi Steve,
You said: "...availability manger ..." Were there 3 wise men too?

You said: "... (cohabitating the raised floor) ..." What was the couple doing on the raised floor?

Regards, David

On 2025-03-09 19:07, Steve Estle wrote:
Hi all,

I have so many stories I could share - mainframe and beyond since in essence I 
worked for years as an availability manger (think critsit manager).  Some human 
caused some not...  Here's two that I'll share - seems this would be a great 
presentation (Guiness World Records on greatest outages :-):

1. Early in my career after hiring to IBM in Tucson straight out of college in early 80's.  Our internal IBM 
systems (my recollection was a 3081 - of course water chilled back then).  One day I was working and all of a 
sudden the system went dead - our operations was right up the hallway.  I quickly ran out to the operations 
team (cohabitating the raised floor) and asked what caused our main system go down?  To my amazement they 
pointed out to the raised floor - I then looked out on the floor and became aghast.  Our chilled water (who 
designed this?) ran thru the ceiliing of the data center.  The cause of the outage stemmed from a main 
chilled water pipe bursting in the ceiling and lo and behold had doused our 3081 processor completely.  Of 
course back then no DR solution in place...  The poor CE team that inherited this "natural" 
disaster had to then proceed to go thru and one by one remove each and every card in the 3081 machine and 
using a high tech "blow dryer" dry the card out and then reinsert the card back in and run 
diagnostics on it.  In mean time basically they ordered replacements for nearly every card in the machine 
because of course no idea how many would come back "alive".  Think we were down for nearly two days 
trying to figure this one out.   Moral - never mix water and electronics - only bad ensues!

2. My second one is not mainframe related - but involves a distributed environment which occurred 
many years later as part of my fun as an Availability mgr role.  Our team in Boulder supported 
customers in a "shared" SAN - storage area network (think Cloud like storage on demand 
environment).  The outage stemmed from a very simple mistake - the SAN administrator was 
performing a relatively straight forward change to add LUN's (storage) for a new server.  To do 
this you of course have to update the zoning configuration in both the dual SAN's - we'll call 
them SAN A & B to keep it simple.  The SAN admin was at the end of his shift but wanted to 
get this change done prior to leaving for day.  So the SAN admin pulled down the zoning 
configurations from SAN A & B and made the needed updates per specifications.  The 
administrator then proceeded to activate the change via the following steps:

    A. Took the updated zoning configuration for SAN A & proceeded to activate 
it in SAN B.
    B. Took the updated zoning configuration for SAN B & proceeded to activate 
it in SAN A.

Instantly all servers (100+ including virtual machines) immediately lost access to their storage 
and my "pager" went off.  Long story short the fixing the zoning change took minutes but 
the recovery of all the servers / apps took days.  We nicknamed the day "Black Thursday".

Many more in my memory banks but these are two that take the cake :-)

Life in the fast lane of technology :-0

Steve

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email [email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to