While YMMV, out experience has been that any utility power failure
lasting more than 5-15 seconds is a solid failure and the outage will
invariably be an hour or more while the utility company locates and
fixes the problem. This means that unless you have an extraordinary
UPS, or functional generators able to recharge the UPS, you are going
down -- The only issue is when or how. Given the choice, a controlled
shutdown from which restart is almost guaranteed is infinitely better
than gambling the potential of adding hours of downtime and potential
data loss from an abrupt termination for the questionable benefit of
staying up for a few minutes longer.
One of issues sounds like a management problem. Placing the power to
make a shutdown decision solely in the hands of a "duty manager" who is
not 100% available obviously doesn't work for decisions requiring a 10
minute or better response time. The other issue is that you must have
automation support in place to minimize the z/OS shutdown time and
documented emergency shutdown procedure that are required reading for
whoever may have to effect the shutdown.
We have documented procedures for emergency system and hardware shutdown
and z/OS automated procedures (using Netview automation and CBT freebie
program NETINIT/NETSTOP) to take down online and batch systems and DB2
as quickly as possible. These are the same procedures used for normal
IPL shutdown, so they are tested regularly. Normally Operations would
consult with whoever is on call in Technical Services (and someone is
always available) and we would advise whether to initiate a system
shutdown or do it ourselves if on site; but if communication is
impossible within the allowed time frame, that decision must be made by
the ranking Operator on site.
Our procedures also document a quick and dirty shutdown method if there
is reason to believe the remaining UPS time is at best only one or two
minutes instead of the typical 15+ minutes - namely, "QUIESCE" z/OS,
"SYSTEM RESET" the production LPAR, and power down the processor and
other hardware ASAP. There is greater risk of logical damage - DB2
threads in a questionable state and possibly a need to recover some
specific tables from archive logs -- but doing a controlled hardware
shutdown should at least eliminate any hardware issues on restart.
Joel C Ewing
Kelly Bert Manning wrote:
Please don't laugh.
I work with applications on a non-sysplex and non-xrf, supported, z/OS
where there have been 3 cases of UPS batteries draining flat,
followed by uncontrolled server crashes, in the past 17 years.
They all happened in October and November, gale season (Cue background
music with the "Gales of November" line by Gordon Lightfoot)
After the first one the data center operator said that they would consider
giving operators authority to shut down OS/390 if they were unable to
make immediate contact with the "Duty Manager" after discovering that
UPS batteries were draining during a power failure and that generator
power was not available or failed after starting.
Four weeks later a carbon copy crash occurred, inspriring a promise that
operators would start draining CICS and IMS message queues and stopping
and rolling back BMPs and DB2 online jobs, while there was still power
in batteries.
Roll forward to this decade, power off during gale season, generators
start, but one fails and goes offline, followed by other mayhem in the
power hardware. Back on batteries for 22 minutes, until they drain and
the z server crashes. Current operator says "what promise to shut
everything down cleanly before the batteries drain?".
Is 22 minutes an unreasonable time figure for purging IMS messaqe
queues, bringing down CICS regions, draining initiators, and abending
and rolling back online IMS and DB2 jobs to the last checkpoint, swapping
logs, writing and dismounting log backups and turning off power before
sudden power loss starts to play mayhem with disk and other hardware?
Oh did I mention, the 2 CPU single processor was only about 30% busy at the
time, the Sunday weekly low CPU use period.
We had a different sort of power outage after the first of the 2 crashes
last decade. Somebody working for one of the potential bidders used
a metal tape measure in an attempt to measure clearance around the
power cable entrance to the building. The resulting demonstration of
how much power moves through the space around a high voltage cable
destroyed several 3380 clone drives, in addition to crashing all
the OS/390 processors. I earned my DBA pay that day.
Bottom line, what should happen when UPS batteries start to drain and
there is no prospect of reliable, high quality, utility power being
restored quickly? Leave it up and roll the dice about losing work
in progress and log data (head crashes and cache controller microcode
bugs) or shut it down cleanly?
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html