Re: ups batteries draining, can't switch to generators

Joel C Ewing Sat, 23 May 2009 10:16:05 -0700

While YMMV, out experience has been that any utility power failurelasting more than 5-15 seconds is a solid failure and the outage willinvariably be an hour or more while the utility company locates andfixes the problem. This means that unless you have an extraordinaryUPS, or functional generators able to recharge the UPS, you are goingdown -- The only issue is when or how. Given the choice, a controlledshutdown from which restart is almost guaranteed is infinitely betterthan gambling the potential of adding hours of downtime and potentialdata loss from an abrupt termination for the questionable benefit ofstaying up for a few minutes longer.

One of issues sounds like a management problem. Placing the power tomake a shutdown decision solely in the hands of a "duty manager" who isnot 100% available obviously doesn't work for decisions requiring a 10minute or better response time. The other issue is that you must haveautomation support in place to minimize the z/OS shutdown time anddocumented emergency shutdown procedure that are required reading forwhoever may have to effect the shutdown.

We have documented procedures for emergency system and hardware shutdownand z/OS automated procedures (using Netview automation and CBT freebieprogram NETINIT/NETSTOP) to take down online and batch systems and DB2as quickly as possible. These are the same procedures used for normalIPL shutdown, so they are tested regularly. Normally Operations wouldconsult with whoever is on call in Technical Services (and someone isalways available) and we would advise whether to initiate a systemshutdown or do it ourselves if on site; but if communication isimpossible within the allowed time frame, that decision must be made bythe ranking Operator on site.

Our procedures also document a quick and dirty shutdown method if thereis reason to believe the remaining UPS time is at best only one or twominutes instead of the typical 15+ minutes - namely, "QUIESCE" z/OS,"SYSTEM RESET" the production LPAR, and power down the processor andother hardware ASAP. There is greater risk of logical damage - DB2threads in a questionable state and possibly a need to recover somespecific tables from archive logs -- but doing a controlled hardwareshutdown should at least eliminate any hardware issues on restart.

  Joel C Ewing

Kelly Bert Manning wrote:

Please don't laugh.

I work with applications on a non-sysplex and non-xrf, supported, z/OS

where there have been 3 cases of UPS batteries draining flat,followed by uncontrolled server crashes, in the past 17 years.


They all happened in October and November, gale season (Cue background
music with the "Gales of November" line by Gordon Lightfoot)

After the first one the data center operator said that they would consider
giving operators authority to shut down OS/390 if they were unable to
make immediate contact with the "Duty Manager" after discovering that
UPS batteries were draining during a power failure and that generator
power was not available or failed after starting.

Four weeks later a carbon copy crash occurred, inspriring a promise that
operators would start draining CICS and IMS message queues and stopping
and rolling back BMPs and DB2 online jobs, while there was still power
in batteries.

Roll forward to this decade, power off during gale season, generators
start, but one fails and goes offline, followed by other mayhem in the
power hardware. Back on batteries for 22 minutes, until they drain and
the z server crashes. Current operator says "what promise to shut
everything down cleanly before the batteries drain?".

Is 22 minutes an unreasonable time figure for purging IMS messaqe
queues, bringing down CICS regions, draining initiators, and abending
and rolling back online IMS and DB2 jobs to the last checkpoint, swapping

logs, writing and dismounting log backups and turning off power beforesudden power loss starts to play mayhem with disk and other hardware?


Oh did I mention, the 2 CPU single processor was only about 30% busy at the
time, the Sunday weekly low CPU use period.

We had a different sort of power outage after the first of the 2 crashes
last decade. Somebody working for one of the potential bidders used
a metal tape measure in an attempt to measure clearance around the
power cable entrance to the building. The resulting demonstration of
how much power moves through the space around a high voltage cable
destroyed several 3380 clone drives, in addition to crashing all
the OS/390 processors. I earned my DBA pay that day.

Bottom line, what should happen when UPS batteries start to drain and
there is no prospect of reliable, high quality, utility power being
restored quickly? Leave it up and roll the dice about losing work
in progress and log data (head crashes and cache controller microcode
bugs) or shut it down cleanly?


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: ups batteries draining, can't switch to generators

Reply via email to