I made some testing to see what happens in a case of a CF failure or CF and
system(s) failure.

Our test parallel sysplex configuration:

  4 systems (S1 - S4) z/OS 1.6
  2 coupling facilities (CF1 & CF2) CFLEVEL 14
  SMF policy with ISOLATETIME(0) CONNFAIL(NO)
  GRS STAR configuration (ISGLOCK in CF2)

I was said that loosing one or more systems and the ISGLOCK structure at the
same time could bring the whole sysplex down if we use SFM weight defaults
of 1.
Our ISGLOCK structure has a default REBUILDPERCENT (according to
documentation it should be 1, but in the message IXC360I displayed as
"REBUILD PERCENT: N/A")

Test scenario 1:

  - set SFM weight for S3 to 40 (that was supposed to be the minimal weight
for the important systems), other systems have weight = 1
  - deactivate lpars S3, S4, and CF2 at the same time

Result:
  - systems S1 and S2 partitioned the S3 and S4 from the sysplex and rebuilt
the ISGLOCK. Everything works as expected (except operlog, etc.)


After that I tried to build a scenario where the ISGLOCK rebuild will not
happen. I took the explanation from the "Setting Up the Sysplex"
(SA22-7625-09 because of the APAR OA05860).

Test scenario 2:

  - set SFM weights for S2, S3, S4 to 9999, S1 to 1
  - deactivate lpars S2, S3, S4, and CF2 at the same time

I expected the system S1 not to be able to start a rebuild process because
of the explanation and formula in the manual.

Results:
  - system S1 partitioned the S2 - S4 from the sysplex and _then_ it rebuilt
the ISGLOCK. Everything looks normal (as in results in scenario 1)


Now I have some questions:

1. Does a system / connector rebuild a structure with default rebuild
percent in a case of a connectivity loss no matter what SFM weights are
defined?

2. To rephrase the question 1:
   Is the default rebuild percent the same as REBUILDPERCENT(1)?

3. What happens in a case that I lose one or more systems and the primary
couple data sets at the same time? Is the partitioning of the lost system(s)
still possible or the sysplex hangs because the CDS switches to alternate
cannot be acknowledged by failed systems? That is, what is the order how it
is done:
   first partition of the failed system(s) and than cds switch or the other
way around? (Such a test is not easy to organize.)

I hope i have not make the mail to complicated.

Zaromil

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to