Hi Greg,

>A FORCE command becomes a CALLRTM TYPE=MEMTERM request which will drive
>address space level resource manager routines in the *MASTER* address
>space for the memterm'd address space.  Ideally a resource manager
>routine should never need, or at least wait for, serialization of a
>resource during its processing.  But the ideal is rarely the reality,
>and resource manager routines can, and often do, get delayed or
>deadlocked.

I am guessing that this is what happened here. When I wrote the management 
summary after analysis (after I had posted here), it seems as if the initiator 
*did* terminate, but about 20 minutes later after causing problems in 
production (the original latch contention was on an AD system):
D GRS,CONTENTION,ENQ                                                
ISG343I 11.30.00 GRS STATUS 419                                     
S=SYSTEM  SYSZRACF SYS1.RACF.BACK                                   
SYSNAME        JOBNAME         ASID     TCBADDR   EXC/SHR    STATUS 
AWE2      CTSACS             010F       009C68F8 EXCLUSIVE    OWN   
AWE2      MIMGR              00CF       008E2398 EXCLUSIVE    WAIT  
S=SYSTEM  SYSZRAC2 SYS1.RACF                                        
SYSNAME        JOBNAME         ASID     TCBADDR   EXC/SHR    STATUS 
AWE2      INIT               0040       009FF6C8 EXCLUSIVE    OWN
AWE2      TCPTNS01           00DC       009CAE88   SHARE      WAIT  
AWE2      xxxx               0030       00919838   SHARE      WAIT  
AWE2      INIT               00B2       009FF6C8   SHARE      WAIT  
S=SYSTEM  SYSZRAC2 SYS1.RACF.BACK                                   
SYSNAME        JOBNAME         ASID     TCBADDR   EXC/SHR    STATUS 
AWE2      CTSACS             010F       009C68F8 EXCLUSIVE    OWN   
AWE2      INIT               0040       009FF6C8 EXCLUSIVE    WAIT  

After I had forced CTSACS out of the system, the dump I had taken (of the GRS 
address space only) finally came through and got written out. And I found 
IEF403I INIT - STARTED - TIME=11.44.01 for the initiator in question right 
after the force completed. It is quite possible that the contention in 
production started to resolve itself at this point, but we all were rather 
frazzled and had already decided to reIPL the system.
 
I am guessing that CTSACS was caught up in the parallel Unix Latch contention 
we had that does not show up on any D GRS,C  or D GRS,LATCH,C displays. (That 
address space uses USS due to IP connections to work stations.) I also cannot 
find latch contention information in my dump, although there definitely was 
some at that point.

>RTM *does* monitor resource manager routine execution (Paul and I added
>this support many years ago), expecting the task it is running under to
>have executed something across an interval of time.  If two intervals
>pass without any execution occurring then RTM will ABTERM the task with
>an ABEND 30D (
>).
>
>It sounds like an address space level resource manager routine became
>hung, possibly due to the same latching issue you issued the FORCE for.
>I suggest you check in LOGREC for any ABEND 30D records during the
>interval to see if a hung resource manager routine was detected by RTM,
>and go from there.
No abend30D in logrec, and aside from my dump of GRS no other dump, either. 
Traditionally, this installation has never had sadump configured or I would 
have taken one. (I plan to fix that this year!)

Interestingly enough, ASCBEWST from CTSACS in my dump shows a timestamp of 
11:08. At 11:04 we had the first IGW038A Possible PDSE Problem(s) (SMSPDSE1) 
and at 11:05 we had the first BPXM123E UNIX SYSTEM SERVICES HAS DETECTED THAT A 
GRS LATCH HAS BEEN HELD BY JOB xxxxxx FOR AN EXTENDED PERIOD OF TIME.

And yes, we definitely had PDSE Latch contention, and *something* was very 
wrong. Note that there is no data set name in the display:
V SMS,PDSE1,ANALYSIS
IGW031I PDSE ANALYSIS  Start of Report(SMSPDSE1) 214          
---------data set name---------------------- -----vsgt------- 
                                                  01-ASPIL0-00011C
++ Unable to latch DIB:000000007E9BE0A0                       
  Latch:000000007F7D5F08 Holder(0040:009C7248)               
  Holding Job:AUTO39P                                        
  CallingSequence:IGWLHJIN IGWDACND IGWDBPAR IGWFARR3        
++ Unable to latch LSDLTDW:000000007F6BF680                   
   Latch:000000007F6BF798 Holder(0040:009C7248)               
   Holding Job:AUTO39P                                        
++ Lock GLOBAL DIRECTORY SHARED                               
    held for at least 842 seconds                              
    Hl1b:7F7D5DB0 HOLDER(0040:009C7248)                        
    Holding Job:AUTO39P                                        
 ++ Lock GLOBAL FORMATWRITE SHARED                             
    held for at least 842 seconds                              
    Hl1b:7F7D5DB0 HOLDER(0040:009C7248)                        
   Holding Job:AUTO39P                                        

There was no ENQ contention when we first had this latch contention. It looks a 
bit like oa51460, except the address space in question never got canceled. 
There was an abend913 earlier in PDSE processing showing the same calling 
sequence as mentioned in oa5160, but that was another address space.
I have no clue if or how MIM handles Latches.

Thanks for the thoughts!

Barbara

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to