Hi all, I have so many stories I could share - mainframe and beyond since in essence I worked for years as an availability manger (think critsit manager). Some human caused some not... Here's two that I'll share - seems this would be a great presentation (Guiness World Records on greatest outages :-):
1. Early in my career after hiring to IBM in Tucson straight out of college in early 80's. Our internal IBM systems (my recollection was a 3081 - of course water chilled back then). One day I was working and all of a sudden the system went dead - our operations was right up the hallway. I quickly ran out to the operations team (cohabitating the raised floor) and asked what caused our main system go down? To my amazement they pointed out to the raised floor - I then looked out on the floor and became aghast. Our chilled water (who designed this?) ran thru the ceiliing of the data center. The cause of the outage stemmed from a main chilled water pipe bursting in the ceiling and lo and behold had doused our 3081 processor completely. Of course back then no DR solution in place... The poor CE team that inherited this "natural" disaster had to then proceed to go thru and one by one remove each and every card in the 3081 machine and using a high tech "blow dryer" dry the card out and then reinsert the card back in and run diagnostics on it. In mean time basically they ordered replacements for nearly every card in the machine because of course no idea how many would come back "alive". Think we were down for nearly two days trying to figure this one out. Moral - never mix water and electronics - only bad ensues! 2. My second one is not mainframe related - but involves a distributed environment which occurred many years later as part of my fun as an Availability mgr role. Our team in Boulder supported customers in a "shared" SAN - storage area network (think Cloud like storage on demand environment). The outage stemmed from a very simple mistake - the SAN administrator was performing a relatively straight forward change to add LUN's (storage) for a new server. To do this you of course have to update the zoning configuration in both the dual SAN's - we'll call them SAN A & B to keep it simple. The SAN admin was at the end of his shift but wanted to get this change done prior to leaving for day. So the SAN admin pulled down the zoning configurations from SAN A & B and made the needed updates per specifications. The administrator then proceeded to activate the change via the following steps: A. Took the updated zoning configuration for SAN A & proceeded to activate it in SAN B. B. Took the updated zoning configuration for SAN B & proceeded to activate it in SAN A. Instantly all servers (100+ including virtual machines) immediately lost access to their storage and my "pager" went off. Long story short the fixing the zoning change took minutes but the recovery of all the servers / apps took days. We nicknamed the day "Black Thursday". Many more in my memory banks but these are two that take the cake :-) Life in the fast lane of technology :-0 Steve ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
