On Thu, 12 Sep 2013 15:38:08 -0500, Mark Zelden wrote:

>
>Some suggestions:
>
>1) System Health Checker.
>
>2) Any other threshold monitor (Omegamon, TMON, Sysview, Mainview, etc).    
>Health Checker
>however has the right price if you don't have one of those (free!).
>
>3) PFA (Predictive Failure Analysis).

C'mon Mark, don't be so modest- running a script to pull the numbers from the 
GDA like your IPLINFO exec does would also be a well-priced alternative ...   
;-)

Having also had an outage for critical SQA (ESQA in this case) shortage in the 
past week, I can sympathise with the OP. ESQA spilled to ECSA, machine stopped 
responding to any carbon-based lifeform. GRS apparently kept kicking the RSA 
around the ring, and jobs on the queue initiated, but eventually it had to be 
bounced.
No, I don't have a stand-alone dump, but I did meander through an 878-8 which 
is of course a little late in proceedings. However it did show all the action 
was in ASID 0001 - so I'm thinking recovery routines or long lived scheduled 
(vendor ??? - don't get me started again ...) tasks misbehaving. Too many 
broken/unavailable control blocks to be sure.

As for Marks suggestions:
HC - didn't hear of any alerts, but will check when I get back to looking at 
the logs
Monitors - again nothing mentioned. Note to self; check why not.
PFA - useless as a lifeboat on a camel on small systems without operlog.

The age-old question of how you recognise loops as mentioned earlier still 
stands - but shared storage can be tested against high-water marks. I'm 
currently looking to get Omegamon to issue SNMP alerts for things like this, 
but it's more convoluted than it should be.

Shane ...

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to