Re: Outages avoided (was: Stupid outages you caused)

David Spiegel Sun, 09 Mar 2025 17:24:48 -0700

Hi Steve,
You said: "...availability manger ..." Were there 3 wise men too?

You said: "... (cohabitating the raised floor) ..." What was the coupledoing on the raised floor?


Regards, David

On 2025-03-09 19:07, Steve Estle wrote:

Hi all,

I have so many stories I could share - mainframe and beyond since in essence I
worked for years as an availability manger (think critsit manager). Some human
caused some not... Here's two that I'll share - seems this would be a great
presentation (Guiness World Records on greatest outages :-):

1. Early in my career after hiring to IBM in Tucson straight out of college in early 80's. Our internal IBM
systems (my recollection was a 3081 - of course water chilled back then). One day I was working and all of a
sudden the system went dead - our operations was right up the hallway. I quickly ran out to the operations
team (cohabitating the raised floor) and asked what caused our main system go down? To my amazement they
pointed out to the raised floor - I then looked out on the floor and became aghast. Our chilled water (who
designed this?) ran thru the ceiliing of the data center. The cause of the outage stemmed from a main
chilled water pipe bursting in the ceiling and lo and behold had doused our 3081 processor completely. Of
course back then no DR solution in place... The poor CE team that inherited this "natural"
disaster had to then proceed to go thru and one by one remove each and every card in the 3081 machine and
using a high tech "blow dryer" dry the card out and then reinsert the card back in and run
diagnostics on it. In mean time basically they ordered replacements for nearly every card in the machine
because of course no idea how many would come back "alive". Think we were down for nearly two days
trying to figure this one out. Moral - never mix water and electronics - only bad ensues!

2. My second one is not mainframe related - but involves a distributed environment which occurred
many years later as part of my fun as an Availability mgr role. Our team in Boulder supported
customers in a "shared" SAN - storage area network (think Cloud like storage on demand
environment). The outage stemmed from a very simple mistake - the SAN administrator was
performing a relatively straight forward change to add LUN's (storage) for a new server. To do
this you of course have to update the zoning configuration in both the dual SAN's - we'll call
them SAN A & B to keep it simple. The SAN admin was at the end of his shift but wanted to
get this change done prior to leaving for day. So the SAN admin pulled down the zoning
configurations from SAN A & B and made the needed updates per specifications. The
administrator then proceeded to activate the change via the following steps:

A. Took the updated zoning configuration for SAN A & proceeded to activate
it in SAN B.
B. Took the updated zoning configuration for SAN B & proceeded to activate
it in SAN A.

Instantly all servers (100+ including virtual machines) immediately lost access to their storage
and my "pager" went off. Long story short the fixing the zoning change took minutes but
the recovery of all the servers / apps took days. We nicknamed the day "Black Thursday".

Many more in my memory banks but these are two that take the cake :-)

Life in the fast lane of technology :-0

Steve

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email [email protected] with the message: INFO IBM-MAIN


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Outages avoided (was: Stupid outages you caused)

Reply via email to