Hi all,
We had a small incident here last week and I wanted to hear your take about
it...

 We have 2 LPARs in a sysplex, running on 2 different machines in 2
different sites.
What happened was we lost connectivity between our 2 sites for a few
seconds.
As a result, MVSB (running in site B) lost its connectivity to the
primary SYSPELX couple data set residing on dasd in site A, and issued the
following message:
IXC253I PRIMARY COUPLE DATA SET 953
XCF.COUPLE01 FOR SYSPLEX
IS BEING REMOVED BECAUSE OF AN I/O ERROR
DETECTED BY SYSTEM MVSB
ERROR CASE: PERMANENT ERROR

The above message was then issued by MVSA as well.
Sadly enough, our alternate SYSPLEX couple data set resides on dasd in site
B.
So MVSA had no connectivity to it, which led to a Disabled Wait 0A2 RC 20 in
MVSA.

After that, MVSB issued the following message:
IXC256A REMOVAL OF PRIMARY COUPLE DATA SET 463
XCF.COUPLE01 FOR SYSPLEX
CANNOT COMPLETE UNTIL
THE FOLLOWING SYSTEM(S) ACKNOWLEDGE THE REMOVAL:
MVSA

Of course, MVSA could never acknowledge since it was in a disabled wait.

IXC256A rolled off the MVSB console (which was in DEL=R mode), so by the
time I got to the console I couldnt see it and didnt know it was issued.
At MVSB's console, I issued a D R,R and didnt see anything.
After I saw why MVSA entered the wait, I issued D XCF,C at MVSB's console
and never got a response.
Eventually we IPLed both MVSB and MVSA because it seemed like MVSB was
hung...

I realize there were many mistakes done along the way here, my question is,
how could I know that IXC256A was issued if it rolled off the console (TSO/E
was hung too)?? If i knew it was issued, i would issue a V
XCF,MVSA,OFFLINE,FORCE and let MVSB complete its couple data set switch...

Also, I dont understand the logic here. MVSA had access to the primary, but
not to the alternate. MVSB had access to the alternate, but not to the
primary. Still, MVSA disabled wait and MVSB stayed up, hung until MVSA
cleanup...

The same exact thing happened on our 2nd sysplex. The 2nd sysplex consists
of 4 LPARs, 2 in site A and 2 in site B. On this sysplex the 2 systems on
site A entered a disabed wait 0A2 and the other 2 on site B stayed hung
waiting for their cleanup...

In either case, I ended up with half a sysplex in a disabled wait and half
hung. Which got me thinking... what if there were 7 systems on site A and
only 1 system on site B?? would z/OS logic still be to enter 7 systems into
a disabled wait instead of only the 1 system that lost access to the
primary???

Basically you can say we learned the true value of SFM. Had we been using
it, it would probably prevent the hang in MVSB, because it would clean up
the mess left by MVSA after it entered the disabled wait. Would SFM also
help in the 7-1 case??

Thanks,
Gil.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to