Hello Group, Last week we had a problem which ended in a full Sysplex outage. After extensive searching, I think it was XCF recovery that moved the Sysplex to the total outage.
We run a 2 site Sysplex, with PPRC mirroring between the sites, with Primary Dasd in SITE1, with our main production LPARs at SITE1 and GDPS to monitor the sites. Due to a 'mistake' all Ficon connections between the 2 sites were removed, but the InfiniBand connections remained active. So DASD mirroring was interrupted, but XCF communication remained working. This result unexpectedly in: all SITE1 LPARs down and all SITE2 LPARs full with unrecoverable problems. After analyzing all logs, we think the following happened: - After the last connection between the centers was removed, SITE1 LPARS still had access to the Primary Dasd and SITE2 LPARs lost access to the Primary Dasd. InfiniBand and XCF connections were not hit. There were several hang in applications, e.g. JES2 Checkpoint Locked etc., but nothing fatal yet. - GDPS declared a FREEZE AND GO. This did not change much, SITE1 still had access to the Primary Dasd and SITE2 remained disconnected from the Primary Dasd. But then: - XCF on the SITE2 LPARs detected it lost access to the Primary CDSs and moved to the Alternate CDSs. - XCF on the SITE1 LPARs were notified of the CDS switches but since they had no access to the new Primary CDSs, they all loaded a Waitstate 0A2, reason code 010: XCF lost access to all couple data sets. Now we ended up with a situation, where our main SITE1 LPARs which still had access to the Primary Dasd were brought down and the SITE2 LPARs without access to Dasd were left over. The only thing remaining was to RESET all SITE2 LPARs and re-IPL the SITE1 LPARs. Evaluating this makes us conclude that: - XCF helped the Sysplex into a total down, because of the SITE2 reaction on the loss of access to the Primary CDSs. - In SFM we specify that our SITE1 LPARs have a much higher weight than the SITE2 LPARs, but XCF does not make an SFM analyses, but simply reacts directly to the CDS loss event. - If the Primary CDSs were located in SITE2, the opposite would have happened: SITE1 lost the Primary CDSs (in SITE2), switched to the Alternates (in SITE1), which were inaccessible to SITE2 LPARs and they would then load a Waitstate 0A2-010. This is the way we would have liked the situation was solved. The GDPS manuals have some statements about the location of Primary and Alternate CDSs, but does not declare a hard recommendation, only a 'logical configuration'. However, in the GDPS CDS configuration panels, the NORMAL configuration defines the Primary CDSs at SITE1 and the Alternates at SITE2. We did not find how we can change this. Furthermore, GDPS Monitor1 checks the CDS orientation and from Hyperswap tests we know that it complains with GEO2643W if we do not change the CDS orientation after Hyperswap. Should it become a CDS configuration requirement to locate the Primary CDSs at the Secondary Dasd site and Alternate CDSs at the Primary Dasd site? Did I overlook something? Regards, Kees. ******************************************************** For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ******************************************************** ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
