Hi,
This thread was very timely. I have an SFM policy and it has worked
fine since I had it setup a couple years ago.
Last Thursday we had a CEC outage and lost 6 of the 8 systems in this
Parallel Sysplex at the same time and an ICF on the same CEC as those 6
systems. Yeah that hurt:-( Another CEC, including an ICF, and an
external CF survived. We are still working with IBM to the address all
the aspects of the CEC outage but we think it would have been reduced in
impact or avoided if we had a recent HIPER MCL 082 (J99673 stream)
installed on this 2094. The SFM Policy did not partition the dead
systems out of the Sysplex without operator intervention. The remaining
two systems kept running but hung up till operators manually replied.
Looked for IXC256A did not find that it was issued but I am tracking
APAR Identifier ...... OA14593 MSGIXC256A NOT RESPONDED TO BECAUSE IT IS
NOT READILY AVAILABLE. This was just an interesting APAR I turned up in
IBMLink seems unrelated to this situation.
The recovery hung up till operators replied to IXC102A.
Failed 17:31
17:31:57.17 00000090 *IXC427A SYSTEM BTST HAS NOT UPDATED
STATUS SINCE 17:31:05 679
679 00000090 BUT IS SENDING XCF SIGNALS. XCF SYSPLEX
FAILURE MANAGEMENT WILL
679 00000090 REMOVE SYSTEM BTST IF NO SIGNALS ARE
RECEIVED WITHIN A 45
679 00000090 SECOND INTERVAL.
17:31:57.17 00000090 *466 IXC426D SYSTEM BTST IS SENDING XCF
SIGNALS BUT NOT UPDATING
STATUS. REPLY SYSNAME=BTST TO REMOVE THE
SYSTEM.
17:31:57.54 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100BSYS
17:31:57.54 STC32489 00000090 PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0E
17:31:58.71 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100PT01
17:31:58.71 STC32489 00000090 PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0C
17:31:59.63 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100PT01
17:31:59.63 STC32489 00000090 PXM4705 XMANAGER PLEXGRPX NEW=03 OLD=03
TYPE=0E
17:31:59.79 00000094 IEE400I THESE MESSAGES CANCELLED - 466.
17:32:00.08 STC32489 00000090 PXM4704 XMANAGER PLEXGRPX GRP=PXM1100
MEM=XMAN1100BTST
17:32:33.12 00000090 *467 IXC102A XCF IS WAITING FOR SYSTEM
PT02 DEACTIVATION. REPLY DOW
WHEN MVS ON PT02 HAS BEEN SYSTEM RESET
The WTOR's remained outstanding.
486 R 17.32.39 ASYS *486 IXC102A XCF IS WAITING
FOR SYSTEM BEND DEACTIVATION.
REPLY DOWN WHEN MVS ON BEND
HAS BEEN SYSTEM RESET
485 R 17.32.33 ASYS *485 IXC102A XCF IS WAITING
FOR SYSTEM BTST DEACTIVATION.
REPLY DOWN WHEN MVS ON BTST
HAS BEEN SYSTEM RESET
484 R 17.32.33 ASYS *484 IXC102A XCF IS WAITING
FOR SYSTEM PT01 DEACTIVATION.
REPLY DOWN WHEN MVS ON PT01
HAS BEEN SYSTEM RESET
482 R 17.32.30 ASYS *482 IXC102A XCF IS WAITING
FOR SYSTEM HSYS DEACTIVATION.
REPLY DOWN WHEN MVS ON HSYS
HAS BEEN SYSTEM RESET
483 R 17.32.30 ASYS *483 IXC102A XCF IS WAITING
FOR SYSTEM BSYS DEACTIVATION.
REPLY DOWN WHEN MVS ON BSYS
HAS BEEN SYSTEM RESET
17:39:19.44 CSYS0050 00000290 R 467,DOWN
17:43:04.60 CSYS0050 00000290 R 486,DOWN
Etc.
17:39:22.30 00000090 IXC105I SYSPLEX PARTITIONING HAS
COMPLETED FOR PT02 324
324 00000090 - PRIMARY REASON: SYSTEM REMOVED BY
SYSPLEX FAILURE MANAGEMENT BECAUSE
324 00000090 ITS STATUS UPDATE WAS MISSING
324 00000090 - REASON FLAGS: 000100
We specify ISOLATETIME in our SFM policy. I have been reading the
Setting up Sysplex manual and IBMLink but still don't see exactly why
SFM was not able to isolate the failed systems and partition them out of
the Sysplex. We had full connectivity with XCF & 3 CF's for all systems
in the Sysplex. I expect there are some circumstances SFM cannot
handle but this is exactly the kind of crash we want cleaned up
automatically so the remaining systems could process work with minimal
interruption.
DATA TYPE(SFM) REPORT(YES)
DEFINE POLICY NAME(POLICYS1) CONNFAIL(YES) REPLACE(YES)
SYSTEM NAME(*)
ISOLATETIME(0)
WEIGHT(1)
SYSTEM NAME(ASYS)
ISOLATETIME(15)
WEIGHT(100)
SYSTEM NAME(CSYS)
ISOLATETIME(15)
WEIGHT(75)
SYSTEM NAME(BSYS)
WEIGHT(5)
SYSTEM NAME(BEND)
WEIGHT(5)
SYSTEM NAME(PT01)
WEIGHT(5)
I will probably assemble all the documentation and open an ETR but so
far I don't see anything wrong with SFM policy. Operators just
don't/can't sort all this out and respond fast enough when there is a
multi-system failure. If we take an unexpected failure and SFM doesn't
handle it without operator intervention it hurts.
Any ideas? Anything you have done in this area to help speed resolution
of multi-system outages? Is an outage this wide something that SFM
should be able to handle?
Best Regards,
Sam Knutson, GEICO
Performance and Availability Management
mailto:[EMAIL PROTECTED]
(office) 301.986.3574
Quantized Revision of Murphy's Law: Everything goes wrong all at once.
][
====================
This email/fax message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution of this
email/fax is prohibited. If you are not the intended recipient, please
destroy all paper and electronic copies of the original message.
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [EMAIL PROTECTED] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html