Hi all, We had a small incident here last week and I wanted to hear your take about it...
We have 2 LPARs in a sysplex, running on 2 different machines in 2 different sites. What happened was we lost connectivity between our 2 sites for a few seconds. As a result, MVSB (running in site B) lost its connectivity to the primary SYSPELX couple data set residing on dasd in site A, and issued the following message: IXC253I PRIMARY COUPLE DATA SET 953 XCF.COUPLE01 FOR SYSPLEX IS BEING REMOVED BECAUSE OF AN I/O ERROR DETECTED BY SYSTEM MVSB ERROR CASE: PERMANENT ERROR The above message was then issued by MVSA as well. Sadly enough, our alternate SYSPLEX couple data set resides on dasd in site B. So MVSA had no connectivity to it, which led to a Disabled Wait 0A2 RC 20 in MVSA. After that, MVSB issued the following message: IXC256A REMOVAL OF PRIMARY COUPLE DATA SET 463 XCF.COUPLE01 FOR SYSPLEX CANNOT COMPLETE UNTIL THE FOLLOWING SYSTEM(S) ACKNOWLEDGE THE REMOVAL: MVSA Of course, MVSA could never acknowledge since it was in a disabled wait. IXC256A rolled off the MVSB console (which was in DEL=R mode), so by the time I got to the console I couldnt see it and didnt know it was issued. At MVSB's console, I issued a D R,R and didnt see anything. After I saw why MVSA entered the wait, I issued D XCF,C at MVSB's console and never got a response. Eventually we IPLed both MVSB and MVSA because it seemed like MVSB was hung... I realize there were many mistakes done along the way here, my question is, how could I know that IXC256A was issued if it rolled off the console (TSO/E was hung too)?? If i knew it was issued, i would issue a V XCF,MVSA,OFFLINE,FORCE and let MVSB complete its couple data set switch... Also, I dont understand the logic here. MVSA had access to the primary, but not to the alternate. MVSB had access to the alternate, but not to the primary. Still, MVSA disabled wait and MVSB stayed up, hung until MVSA cleanup... The same exact thing happened on our 2nd sysplex. The 2nd sysplex consists of 4 LPARs, 2 in site A and 2 in site B. On this sysplex the 2 systems on site A entered a disabed wait 0A2 and the other 2 on site B stayed hung waiting for their cleanup... In either case, I ended up with half a sysplex in a disabled wait and half hung. Which got me thinking... what if there were 7 systems on site A and only 1 system on site B?? would z/OS logic still be to enter 7 systems into a disabled wait instead of only the 1 system that lost access to the primary??? Basically you can say we learned the true value of SFM. Had we been using it, it would probably prevent the hang in MVSB, because it would clean up the mess left by MVSA after it entered the disabled wait. Would SFM also help in the 7-1 case?? Thanks, Gil. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html

