APAR OA58438 (was Re: Planned ESQA change and HealthCheck)

Mark Zelden Thu, 03 Oct 2019 14:18:53 -0700

If you are running z/OS 2.3 and increasing ESQA because of expansion into ECSA 
messages
or sudden unexplained growth, check out APAR OA58438.


We had 3 system crashes after migrations to z/OS 2.3 in 2019 and one close call 
after
ECSA got to 99% when ESQA expanded into it (only a vendor monitor crashed in 
that
case after a failed ECSA getmain).  Stand alone dumps didn't find the root cause
other than we new it was RPB pool growth related to SVC dumps from CICS.  In one
case a single SVC dump caused an 80M ESQA spike within one or two seconds 
crashed 
a system when it spilled into ECSA and also filled up ECSA (typically at about
70% use, but "stable").  

We worked with IBM all summer on this.  We had different SLIPs and GTF traces 
put in
place, but with the traces going the problem never happen. But SVC dump 
processing
did take over the CPU with the trace + GTF active!   :-)  

Meanwhile, we increased ESQA on 30 LPARs via normal IPLs over the summer by 
about
80M and ECSA a bit as a "work around".   Settings that haven't been touched in 
god knows
how long (certainly not since 64-bit usage has increased and HVCOMMON).   So we 
had
to loose about 100M of high private to do this.  We also increased real storage 
on a 
couple of LPARs that really didn't warrant it (based on zero or close to zero 
demand
paging during normal operations), but we knew real storage was also involved in
the problem (no flash memory for SVC dumps on my client's mainframes).  

The entire time IBM has said we are the only ones reporting the problem, but 
since we
had the problem in big sysplexes, small sysplexes, big LPARs, small LPARs, I 
know that
we can't be the only ones.  I think other shops are ignoring the ESQA expansion 
into
ECSA (since that in itself doesn't hurt) and / or they have more "white space". 
 The
RPB control blocks are freed after about 10 minutes, so anyone looking at their
current ESQA (and ECSA) usage wouldn't notice the spikes or would just say 'oh 
well,
looks good now".   

Anyway,  IBM was getting close to figuring this out not too long ago and 
partially 
re-created the problem in the lab some weeks ago and just got back to us today 
with the root cause and the APAR that was opened.   It is related to being real
storage constrained at the time of the SVC dumps (I think all of the crashes 
were
during CICS startup time in the wee morning hours).  

I really wanted to post something about this earlier but didn't since IBM said
they had no other reported problems,  So if you have seen this problem since
migrating to z/OS 2.3, now you know you aren't the only ones.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

APAR OA58438 (was Re: Planned ESQA change and HealthCheck)

Reply via email to