Hi Greg,
>A FORCE command becomes a CALLRTM TYPE=MEMTERM request which will drive
>address space level resource manager routines in the *MASTER* address
>space for the memterm'd address space. Ideally a resource manager
>routine should never need, or at least wait for, serialization of a
>resource during its processing. But the ideal is rarely the reality,
>and resource manager routines can, and often do, get delayed or
>deadlocked.
I am guessing that this is what happened here. When I wrote the management
summary after analysis (after I had posted here), it seems as if the initiator
*did* terminate, but about 20 minutes later after causing problems in
production (the original latch contention was on an AD system):
D GRS,CONTENTION,ENQ
ISG343I 11.30.00 GRS STATUS 419
S=SYSTEM SYSZRACF SYS1.RACF.BACK
SYSNAME JOBNAME ASID TCBADDR EXC/SHR STATUS
AWE2 CTSACS 010F 009C68F8 EXCLUSIVE OWN
AWE2 MIMGR 00CF 008E2398 EXCLUSIVE WAIT
S=SYSTEM SYSZRAC2 SYS1.RACF
SYSNAME JOBNAME ASID TCBADDR EXC/SHR STATUS
AWE2 INIT 0040 009FF6C8 EXCLUSIVE OWN
AWE2 TCPTNS01 00DC 009CAE88 SHARE WAIT
AWE2 xxxx 0030 00919838 SHARE WAIT
AWE2 INIT 00B2 009FF6C8 SHARE WAIT
S=SYSTEM SYSZRAC2 SYS1.RACF.BACK
SYSNAME JOBNAME ASID TCBADDR EXC/SHR STATUS
AWE2 CTSACS 010F 009C68F8 EXCLUSIVE OWN
AWE2 INIT 0040 009FF6C8 EXCLUSIVE WAIT
After I had forced CTSACS out of the system, the dump I had taken (of the GRS
address space only) finally came through and got written out. And I found
IEF403I INIT - STARTED - TIME=11.44.01 for the initiator in question right
after the force completed. It is quite possible that the contention in
production started to resolve itself at this point, but we all were rather
frazzled and had already decided to reIPL the system.
I am guessing that CTSACS was caught up in the parallel Unix Latch contention
we had that does not show up on any D GRS,C or D GRS,LATCH,C displays. (That
address space uses USS due to IP connections to work stations.) I also cannot
find latch contention information in my dump, although there definitely was
some at that point.
>RTM *does* monitor resource manager routine execution (Paul and I added
>this support many years ago), expecting the task it is running under to
>have executed something across an interval of time. If two intervals
>pass without any execution occurring then RTM will ABTERM the task with
>an ABEND 30D (
>).
>
>It sounds like an address space level resource manager routine became
>hung, possibly due to the same latching issue you issued the FORCE for.
>I suggest you check in LOGREC for any ABEND 30D records during the
>interval to see if a hung resource manager routine was detected by RTM,
>and go from there.
No abend30D in logrec, and aside from my dump of GRS no other dump, either.
Traditionally, this installation has never had sadump configured or I would
have taken one. (I plan to fix that this year!)
Interestingly enough, ASCBEWST from CTSACS in my dump shows a timestamp of
11:08. At 11:04 we had the first IGW038A Possible PDSE Problem(s) (SMSPDSE1)
and at 11:05 we had the first BPXM123E UNIX SYSTEM SERVICES HAS DETECTED THAT A
GRS LATCH HAS BEEN HELD BY JOB xxxxxx FOR AN EXTENDED PERIOD OF TIME.
And yes, we definitely had PDSE Latch contention, and *something* was very
wrong. Note that there is no data set name in the display:
V SMS,PDSE1,ANALYSIS
IGW031I PDSE ANALYSIS Start of Report(SMSPDSE1) 214
---------data set name---------------------- -----vsgt-------
01-ASPIL0-00011C
++ Unable to latch DIB:000000007E9BE0A0
Latch:000000007F7D5F08 Holder(0040:009C7248)
Holding Job:AUTO39P
CallingSequence:IGWLHJIN IGWDACND IGWDBPAR IGWFARR3
++ Unable to latch LSDLTDW:000000007F6BF680
Latch:000000007F6BF798 Holder(0040:009C7248)
Holding Job:AUTO39P
++ Lock GLOBAL DIRECTORY SHARED
held for at least 842 seconds
Hl1b:7F7D5DB0 HOLDER(0040:009C7248)
Holding Job:AUTO39P
++ Lock GLOBAL FORMATWRITE SHARED
held for at least 842 seconds
Hl1b:7F7D5DB0 HOLDER(0040:009C7248)
Holding Job:AUTO39P
There was no ENQ contention when we first had this latch contention. It looks a
bit like oa51460, except the address space in question never got canceled.
There was an abend913 earlier in PDSE processing showing the same calling
sequence as mentioned in oa5160, but that was another address space.
I have no clue if or how MIM handles Latches.
Thanks for the thoughts!
Barbara
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN