Re: Outages avoided (was: Stupid outages you caused)

Steve Estle Sun, 09 Mar 2025 16:08:07 -0700

Hi all,

I have so many stories I could share - mainframe and beyond since in essence I 
worked for years as an availability manger (think critsit manager).  Some human 
caused some not...  Here's two that I'll share - seems this would be a great 
presentation (Guiness World Records on greatest outages :-):


1. Early in my career after hiring to IBM in Tucson straight out of college in 
early 80's.  Our internal IBM systems (my recollection was a 3081 - of course 
water chilled back then).  One day I was working and all of a sudden the system 
went dead - our operations was right up the hallway.  I quickly ran out to the 
operations team (cohabitating the raised floor) and asked what caused our main 
system go down?  To my amazement they pointed out to the raised floor - I then 
looked out on the floor and became aghast.  Our chilled water (who designed 
this?) ran thru the ceiliing of the data center.  The cause of the outage 
stemmed from a main chilled water pipe bursting in the ceiling and lo and 
behold had doused our 3081 processor completely.  Of course back then no DR 
solution in place...  The poor CE team that inherited this "natural" disaster 
had to then proceed to go thru and one by one remove each and every card in the 
3081 machine and using a high tech "blow dryer" dry the card out and then 
reinsert the card back in and run diagnostics on it.  In mean time basically 
they ordered replacements for nearly every card in the machine because of 
course no idea how many would come back "alive".  Think we were down for nearly 
two days trying to figure this one out.   Moral - never mix water and 
electronics - only bad ensues!

2. My second one is not mainframe related - but involves a distributed 
environment which occurred many years later as part of my fun as an 
Availability mgr role.  Our team in Boulder supported customers in a "shared" 
SAN - storage area network (think Cloud like storage on demand environment).  
The outage stemmed from a very simple mistake - the SAN administrator was 
performing a relatively straight forward change to add LUN's (storage) for a 
new server.  To do this you of course have to update the zoning configuration 
in both the dual SAN's - we'll call them SAN A & B to keep it simple.  The SAN 
admin was at the end of his shift but wanted to get this change done prior to 
leaving for day.  So the SAN admin pulled down the zoning configurations from 
SAN A & B and made the needed updates per specifications.  The administrator 
then proceeded to activate the change via the following steps:

   A. Took the updated zoning configuration for SAN A & proceeded to activate 
it in SAN B.
   B. Took the updated zoning configuration for SAN B & proceeded to activate 
it in SAN A.

Instantly all servers (100+ including virtual machines) immediately lost access 
to their storage and my "pager" went off.  Long story short the fixing the 
zoning change took minutes but the recovery of all the servers / apps took 
days.  We nicknamed the day "Black Thursday".

Many more in my memory banks but these are two that take the cake :-)

Life in the fast lane of technology :-0

Steve

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Outages avoided (was: Stupid outages you caused)

Reply via email to