"SFM" is Sysplex Failure Management. You define an "SFM Policy" that
determines how you want z/OS to deal with different sorts of failure, for
example, unresponsive systems or loss of XCF signal connectivity between
systems. If your policy specifies the ISOLATE parameter, the surviving
systems will attempt to automatically isolate (aka fence) an unresponsive
system. If this can be done successfully, the operator will not be prompted.
If the isolation fails, the operator will be prompted.

The issue with an untimely response to the operator prompt is that until the
surviving systems know that the unresponsive system is "DOWN", they expect
it to participate in the sysplex. The lack of a response can lead to
sympathy sickness on the surviving systems. Probably the source of the
various timeouts that were observed.

A strong word of caution:

DO NOT automatically reply DOWN. It must truly be the case that the subject
system has been reset. Failure to ensure that the system is reset can lead
to data corruption. The DOWN reply causes XCF to notify the sysplex
applications that the subject system is removed from the sysplex, the
implication being that said system no longer has access to any shared
resources. Data base managers may then for example, release any locks held
by the dead system. However, if the system has not been reset, there may
still be ongoing I/O manipulating the shared resources (your database). If
so, you in effect have the potential for a rogue write operation that can
corrupt data.

Being in a wait-state is not the same thing as being reset. I/O can still be
ongoing for a system in a wait-state. The system really needs to be reset
(or re-IPL'd) to be completely safe.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to