Re: how to lose a sysplex in 30 seconds

Bill Neiman Mon, 28 Nov 2005 05:25:58 -0800

On Sun, 27 Nov 2005 14:30:16 +0200, Gil Peleg <[EMAIL PROTECTED]> wrote:


> We have 2 LPARs in a sysplex, running on 2 different machines in 2
>different sites.
>What happened was we lost connectivity between our 2 sites for a few
>seconds.
>As a result, MVSB (running in site B) lost its connectivity to the
>primary SYSPELX couple data set residing on dasd in site A, and issued the
>following message:
>IXC253I PRIMARY COUPLE DATA SET 953
>XCF.COUPLE01 FOR SYSPLEX
>IS BEING REMOVED BECAUSE OF AN I/O ERROR
>DETECTED BY SYSTEM MVSB
>ERROR CASE: PERMANENT ERROR
>
>The above message was then issued by MVSA as well.
>Sadly enough, our alternate SYSPLEX couple data set resides on dasd in
site
>B.
>So MVSA had no connectivity to it, which led to a Disabled Wait 0A2 RC 20
in
>MVSA.
>
>After that, MVSB issued the following message:
>IXC256A REMOVAL OF PRIMARY COUPLE DATA SET 463
>XCF.COUPLE01 FOR SYSPLEX
>CANNOT COMPLETE UNTIL
>THE FOLLOWING SYSTEM(S) ACKNOWLEDGE THE REMOVAL:
>MVSA
>
>Of course, MVSA could never acknowledge since it was in a disabled wait.
>
>IXC256A rolled off the MVSB console (which was in DEL=R mode), so by the
>time I got to the console I couldnt see it and didnt know it was issued.
>At MVSB's console, I issued a D R,R and didnt see anything.
>After I saw why MVSA entered the wait, I issued D XCF,C at MVSB's console
>and never got a response.
>Eventually we IPLed both MVSB and MVSA because it seemed like MVSB was
>hung...
>
>I realize there were many mistakes done along the way here, my question
is,
>how could I know that IXC256A was issued if it rolled off the console
(TSO/E
>was hung too)?? If i knew it was issued, i would issue a V
>XCF,MVSA,OFFLINE,FORCE and let MVSB complete its couple data set switch...
>
>Also, I dont understand the logic here. MVSA had access to the primary,
but
>not to the alternate. MVSB had access to the alternate, but not to the
>primary. Still, MVSA disabled wait and MVSB stayed up, hung until MVSA
>cleanup...
>
>The same exact thing happened on our 2nd sysplex. The 2nd sysplex consists
>of 4 LPARs, 2 in site A and 2 in site B. On this sysplex the 2 systems on
>site A entered a disabed wait 0A2 and the other 2 on site B stayed hung
>waiting for their cleanup...
>
>In either case, I ended up with half a sysplex in a disabled wait and half
>hung. Which got me thinking... what if there were 7 systems on site A and
>only 1 system on site B?? would z/OS logic still be to enter 7 systems
into
>a disabled wait instead of only the 1 system that lost access to the
>primary???
>
>Basically you can say we learned the true value of SFM. Had we been using
>it, it would probably prevent the hang in MVSB, because it would clean up
>the mess left by MVSA after it entered the disabled wait. Would SFM also
>help in the 7-1 case??

Gil,

     When any system detects a permanent I/O error during an attempt to
access a couple data set, it initiates removal of that CDS from service.
The removal protocol involves notifying all other systems of the error by
XCF signal, which causes each of the other systems to remove the CDS from
service as well.  Although you say you lost connectivity between your
sites, it must have been the case that signalling connectivity still
existed between them.  Otherwise, MVSA could not have reacted to the loss
of the primary sysplex CDS detected by MVSB.  The existence of signalling
connectivity created a race condition, in which MVSA and MVSB were
competing to detect and report the loss of access to the CDS at their
respective sites.  MVSB won the race, detecting and signalling the loss of
the primary CDS before MVSA detected loss of the alternate.  MVSA got
MVSB's signal, initiated removal of the primary, and then detected the
inaccessibility of the alternate.  In that situation, with only one CDS
remaining, MVSA wait states but does not signal loss of the remaining CDS,
in the hope that its access problem is only a local issue (which it was).
MVSB therefore remained alive, because it was still able to use the
alternate CDS.

     The CDS removal protocol requires that each system acknowledge the
removal signals sent by each other system.  MVSA apparently died before
acknowledging one of MVSB's signals, so MVSB was unable to complete
removal of the primary CDS.  Hence the IXC256A message.  I'm not sure why
a D R,R failed to display the outstanding message, since IXC256A is issued
with descriptor code 11.  Our usual recommendation is that either (1) the
installation maintain a console defined with DEL(RD) and routecode and
level attributes that collect action and eventual action messages, and /
or (2) automate IXC256A.

     In the 7-1 case, the same race condition would exist.  If the "1"
system detected and signalled the loss of one CDS before any of the "7"
systems detected and signalled the loss of the other, you'd wind up with 7
systems down and 1 up but hung waiting for the resolution of IXC256A.

     To resolve IXC256A in this situation, it is necessary to partition
the (wait-stated) systems named in it out of the sysplex.  Since a
permanent error involving the sysplex CDS is in progress, this would
require the FORCE form of the V XCF command (V XCF,sysname,OFF,FORCE).
This response is documented with IXC256A.

     Bill Neiman
     z/OS Development

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: how to lose a sysplex in 30 seconds

Reply via email to