I opened a SR against XCF with IBM, which resulted in the conclusion, that 
there is no simple solution to avoid this situation. 

- Each XCF reacts individually on each CDS problem.
- In our situation, the errors were detected by the SITE2 XCFs accessing the 
Primary CDSs. But they could as well have been detected by the SITE1 XCFs 
accessing the Alternate CDSs in which case the outcome would have been the 
reverse of ours.
- There is no voting, like using SMF weights, to direct the XCF actions into a 
desired direction.

In short, this is how it works.

Kees.


> -----Original Message-----
> From: Vernooij, Kees (ITOPT1) - KLM
> Sent: 21 February, 2018 10:02
> To: '[email protected]' <[email protected]>
> Subject: Problem moved to an outage by XCF recovery?
> 
> Hello Group,
> 
> Last week we had a problem which ended in a full Sysplex outage.
> After extensive searching, I think it was XCF recovery that moved the
> Sysplex to the total outage.
> 
> We run a 2 site Sysplex, with PPRC mirroring between the sites, with
> Primary Dasd in SITE1, with our main production LPARs at SITE1 and GDPS
> to monitor the sites.
> 
> Due to a 'mistake' all Ficon connections between the 2 sites were
> removed, but the InfiniBand connections remained active. So DASD
> mirroring was interrupted, but XCF communication remained working.
> This result unexpectedly in: all SITE1 LPARs down and all SITE2 LPARs
> full with unrecoverable problems.
> 
> After analyzing all logs, we think the following happened:
> - After the last connection between the centers was removed, SITE1 LPARS
> still had access to the Primary Dasd and SITE2 LPARs lost access to the
> Primary Dasd. InfiniBand and XCF connections were not hit. There were
> several hang in applications, e.g. JES2 Checkpoint Locked etc., but
> nothing fatal yet.
> - GDPS declared a FREEZE AND GO. This did not change much, SITE1 still
> had access to the Primary Dasd and SITE2 remained disconnected from the
> Primary Dasd.
> 
> But then:
> - XCF on the SITE2 LPARs detected it lost access to the Primary CDSs and
> moved to the Alternate CDSs.
> - XCF on the SITE1 LPARs were notified of the CDS switches but since
> they had no access to the new Primary CDSs, they all loaded a Waitstate
> 0A2, reason code 010: XCF lost access to all couple data sets.
> Now we ended up with a situation, where our main SITE1 LPARs which still
> had access to the Primary Dasd were brought down and the SITE2 LPARs
> without access to Dasd were left over.
> The only thing remaining was to RESET all SITE2 LPARs and re-IPL the
> SITE1 LPARs.
> 
> Evaluating this makes us conclude that:
> - XCF helped the Sysplex into a total down, because of the SITE2
> reaction on the loss of access to the Primary CDSs.
> - In SFM we specify that our SITE1 LPARs have a much higher weight than
> the SITE2 LPARs, but XCF does not make an SFM analyses, but simply
> reacts directly to the CDS loss event.
> - If the Primary CDSs were located in SITE2, the opposite would have
> happened: SITE1 lost the Primary CDSs (in SITE2), switched to the
> Alternates (in SITE1), which were inaccessible to SITE2 LPARs and they
> would then load a Waitstate 0A2-010.
> This is the way we would have liked the situation was solved.
> 
> The GDPS manuals have some statements about the location of Primary and
> Alternate CDSs, but does not declare a hard recommendation, only a
> 'logical configuration'.
> However, in the GDPS CDS configuration panels, the NORMAL configuration
> defines the Primary CDSs at SITE1 and the Alternates at SITE2. We did
> not find how we can change this.
> Furthermore, GDPS Monitor1 checks the CDS orientation and from Hyperswap
> tests we know that it complains with GEO2643W if we do not change the
> CDS orientation after Hyperswap.
> 
> Should it become a CDS configuration requirement to locate the Primary
> CDSs at the Secondary Dasd site and Alternate CDSs at the Primary Dasd
> site?
> Did I overlook something?
> 
> Regards,
> Kees.
> 
> 

********************************************************
For information, services and offers, please visit our web site: 
http://www.klm.com. This e-mail and any attachment may contain confidential and 
privileged material intended for the addressee only. If you are not the 
addressee, you are notified that no part of the e-mail or any attachment may be 
disclosed, copied or distributed, and that any other action related to this 
e-mail or attachment is strictly prohibited, and may be unlawful. If you have 
received this e-mail by error, please notify the sender immediately by return 
e-mail, and delete this message. 

Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its 
employees shall not be liable for the incorrect or incomplete transmission of 
this e-mail or any attachments, nor responsible for any delay in receipt. 
Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch 
Airlines) is registered in Amstelveen, The Netherlands, with registered number 
33014286
********************************************************

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to