Re: Sysplex timeout problems.

Barbara Nitz Fri, 10 Jul 2009 01:28:19 -0700

>We have INTERVAL(85) and OPNOTIFY(87) and CLEANUP(15) so my question
>is when the resources get frozen. 
I hope that someone will correct me if I get this wrong:


System removal timeline, assuming all defaults:
1. v xcf,sysname,offline and press enter on the reply that comes for the v 
xcf,off.
This is what kicks everything off. In an ideal world, CLEANUP starts to run, 
meaning XCF on the shutdown system notifies all members of all XCF groups in 
the sysplex that it is about to load a wait state. Every XCF group is now 
supposed to do 'cleanup' for the leaving member.

2. After cleanup expires, the shutdown system loads the non-restartable 
wait0A2. 
This kicks off INTERVAL  and OPNOTIFY. Loading a wait state means that the 
other systems detect SSUM (system status update missing), so INTERVAL at 
twice SPINTIME+5s starts to run. At INTERVAL+3s message IXC102A is issued 
on the surviving system.

3. Once the wait state is loaded, system reset can take place. There is no 
need to wait for IXC102A being issued. The reply to IXC102A probably leads to 
XCF telling all XCF group members that 'system reported gone' (and I have no 
clue which is issued in which order - the events being told to all xcf group 
members are described in detail in Sysplex Services Guide under 'Events that 
Cause XCF to Schedule a Group User Routine' - have fun reading that!) 

So much for the system side of shutdown. What I have no clue about is how 
TCPIP handles XCF telling it that a member of the XCF group goes away. 
TCPIP might already start reacting if and when the XCF group member on the 
system where the 'p TCPIP' command was issued for shutdown. Some sort of 
XCF cleanup (like 'leave the group') must be done when tcpip terminates.

I don't know if TCPIPs 'detection of timeout' starts at the point CLEANUP 
starts to run or if it starts with SSUM or in reaction to any of the other 
state 
changes. TCPIP could also have timers completely independent of any XCF 
traffic that cause the messages you see.

>We have never required operators to RESET a down system once a wait
>state is achieved after V XCF OFF. I can't deny the possibility of some
>comatose I/O operation miraculously coming to life after seconds or minutes
>of WAIT STATE. 

Skip, I don't think the danger of a missing reset is so much an errant I/O, 
it's 
rather a hardware reserve that didn't get cleared for one reason or another, 
preventing other systems form accessing that device. Once in a parallel 
sysplex, fencing is done via the CF without the need for explicit system reset 
(except for the last system in the sysplex). So mostly, the warnings are for 
basic sysplexes that cannot do automatic fencing. Here it is really important 
to 
get all reserves released. 

>I can however attest to the inherent risk of an operator--or a
>distracted sysprog--going to the trouble of unlocking an LPAR in order to
>RESET...the wrong image. 

BTDTGTS, too, in this installation. Not by me, fortunately (I shut down one 
system and then varied offline the one that I hadn't shut down - thankfully 
the sysprog sandplex)! Besides, in case there are IPL problems, I wouldn't 
want to leave IBM the slim chance of them telling me 'you shouldn't have done 
that, so we're not fixing anything'. 

Barbara

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Re: Sysplex timeout problems.

Reply via email to