On Thu, 12 Sep 2013 15:38:08 -0500, Mark Zelden wrote: > >Some suggestions: > >1) System Health Checker. > >2) Any other threshold monitor (Omegamon, TMON, Sysview, Mainview, etc). >Health Checker >however has the right price if you don't have one of those (free!). > >3) PFA (Predictive Failure Analysis).
C'mon Mark, don't be so modest- running a script to pull the numbers from the GDA like your IPLINFO exec does would also be a well-priced alternative ... ;-) Having also had an outage for critical SQA (ESQA in this case) shortage in the past week, I can sympathise with the OP. ESQA spilled to ECSA, machine stopped responding to any carbon-based lifeform. GRS apparently kept kicking the RSA around the ring, and jobs on the queue initiated, but eventually it had to be bounced. No, I don't have a stand-alone dump, but I did meander through an 878-8 which is of course a little late in proceedings. However it did show all the action was in ASID 0001 - so I'm thinking recovery routines or long lived scheduled (vendor ??? - don't get me started again ...) tasks misbehaving. Too many broken/unavailable control blocks to be sure. As for Marks suggestions: HC - didn't hear of any alerts, but will check when I get back to looking at the logs Monitors - again nothing mentioned. Note to self; check why not. PFA - useless as a lifeboat on a camel on small systems without operlog. The age-old question of how you recognise loops as mentioned earlier still stands - but shared storage can be tested against high-water marks. I'm currently looking to get Omegamon to issue SNMP alerts for things like this, but it's more convoluted than it should be. Shane ... ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
