Mike has a lot of good ideas here, but I'll comment on a couple of minor points.
1. We have multiple VM systems in our shop, so changing the node name during a
disaster would cause more problems than it would solve. We have too much code
that does things like "If node = 'VMSYS1' Then Do ..." to be able to tolerate
new node names for DR. We update the logo file to change "RUNNING" in the
lower right corner of the screen to "DR SYS". The node name stays the same.
2. The other thing I would never change during DR is the time zone. Your users
have scheduled jobs based on the normal time zone of the system. If you change
the time zone, those jobs will run early or late. If the jobs run late, users
will be late getting their reports, data feeds, or whatever. If the jobs run
early, they may run before feeds from other systems have arrived. Either one
is not good.
Dennis
"See, I'm a man of simple tastes. I like dynamite, and gunpowder... And
gasoline! Do you know what all of these things have in common? They're cheap!"
-- Heath Ledger as The Joker, in The Dark Knight
-----Original Message-----
From: The IBM z/VM Operating System [mailto:[email protected]] On Behalf
Of Mike Walter
Sent: Tuesday, February 02, 2010 15:08
To: [email protected]
Subject: Re: [IBMVM] First time DR excercise
Is your management 100% certain that YOU will survive the unplanned
disaster? Do you agree with them? What happens when there is a natural
disaster where you do survive, but your family needs you immediately? Do
you choose to go to the D.R. site for an unspecified period of time,
hoping that someone else will take care of your family's needs (some could
involve hospitalization, right?).
In that vein, see below for alternative to which I am naturally partial
since it allows easy, well-tested, automated changes when running on a
different CPUs. That's especially handy when an experienced systems
programmer is not available. I have always tied to write my VM D.R. plan
so that a typical VM operator, or even a non-VM manager (!) could bring up
the VM system at least to the point that program product cpuid keys need
to be updated for the recovery CPU.
First, if you plan to restore your production SPOOL data at your D.R.
site, placement (slot numbers) of the SPOOL volumes in the SYSTEM CONFIG's
"OWN"ed list does not matter *if* you are formatting them and restoring
from an SPXTAPE backup (or a product that calls SPXTAPE, such as
VM:Spool). However, I share with you that...
:religion on
1) it is always a "Best Practice" to place your SPOOL volumes in the
SYSTEM CONFIG either starting in SLOT 1,
2) with plenty of RESERVED slots below for adding more (RESERVED slots
have very little overhead, IBM won't even charge you more for them!) e.g.
PRODVM: RECOVERY: CP_Owned Slot 1 VMSP01 OWN
PRODVM: RECOVERY: CP_Owned Slot 2 VMSP02 OWN
PRODVM: RECOVERY: CP_Owned Slot 3 VMSP03 OWN
CP_Owned Slot 4 RESERVED
... and more RESERVED slots for future SPOOL growth as needed...
3) or in the very last slots (i.e. ending with SLOT 255 backing up from
there with plenty of RESERVED slots above),
... and more RESERVED slots above for future SPOOL growth as needed...
CP_Owned Slot 252 RESERVED
PRODVM: RECOVERY: CP_Owned Slot 253 VMSP03 OWN
PRODVM: RECOVERY: CP_Owned Slot 254 VMSP02 OWN
PRODVM: RECOVERY: CP_Owned Slot 255 VMSP01 OWN
4) Regardless of their top or bottom location, include "flashing
red-neon" comments (that's just a point of emphasis, don't search IBM doc
for "flashing red-neon" comments), warning you and your sysprog heirs to
NEVER CHANGE SPOOL VOLUME SLOT NUMBERS WITHOUT A CURRENT SPXTAPE BACKUP
(and a thorough plan already tested on a 2nd level system, and ... an
off-site copy of your resume)!
:religion off
Now, about that SYSTEM CONFIG file.
-----------------------------------
Let's presume that your normal product system runs on an ancient, creaky
old z800, with serial number 12345.
And that your employer has a number of shiny newer boxes with known serial
numbers, upon any of which you might be able to RECOVER your production
z/VM system should your creaky rusted decrepit old z800 crash and take a
while to repair -- as parts are shipped from some far-off location where
old machines are stored (maybe in the desert near all old those old
aircraft?).
If none of those "RECOVERY" systems are available (i.e. it really was a
*true* DISASTER), then you'll come up on a disaster recovery provider's
"DISASTER" machine.
Consider including in the SYSTEM CONFIG something along the lines of:
---<snip>---
/* ----------------------------------------------------------- */
/* Standard operating environment. See also "NODAL CONFIG Y" */
/* which contains some hard-coded device addresses that change */
/* when running on different systems based upon the defined */
/* "System_Identifer" from this file. */
/* Warning!: The last (of duplicate) System_ID record with */
/* matching model and cpuid value overrides previous System_IDs*/
/* ----------------------------------------------------------- */
System_ID 2066 %%2345 PRODVM
System_ID 2094 %%nnn1 RECOVERY
System_ID 2094 %%nnn2 RECOVERY
System_ID 2097 %%nnn3 RECOVERY
System_ID 2097 %%nnn4 RECOVERY
/* Use DISASTER only when we're not in our data center so */
/* that extensive D.R. automation can take place on various */
/* service machines when running PROFILE EXEC/PROFILE GCS's. */
System_Identifier_Default DISASTER
/**********************************************************************/
/* Timezone Definitions */
/* (Next assumes Central time for prod, Eastern time for DISASTER) */
/**********************************************************************/
PRODVM: TESTVM: Timezone_Definition CST West 06.00.00
PRODVM: TESTVM: Timezone_Definition CDT West 05.00.00
RECOVERY: Timezone_Definition CST West 06.00.00
RECOVERY: Timezone_Definition CDT West 05.00.00
DISASTER: Timezone_Definition EST West 05.00.00
DISASTER: Timezone_Definition EDT West 04.00.00
---<snip>---
What's that you say? "What are those lines beginning with 'PRODVM:',
'RECOVERY:', 'DISASTER:', etc.?"
Well, those are the SYSTEM CONFIG's "RECORD QUALIFERS" - oft
overlooked/VERY POWERFUL! Those names "PRODVM", "RECOVERY", "DISASTER"
(or anything you choose to use) were defined above when the CPUID matched
one listed in the "System_ID" records. I place the "System_ID" records
near the top of our SYSTEM CONFIG, so that they are available for use
everywhere below that point.
Any record that begins with one of the "RECORD QUALIFIER" statements, will
only be processed on a CPU with a matching serial number.
So, to automate start-up (maybe you can sleep through a test!?), in
"PROFILE EXEC" on OPERATOR, include some code along the lines of:
---<snip>---
parse value diag(08,'QUERY USERID') with ,
self . ConfigSysID . '15'x .
?prodvm=(ConfigSysID)='PRODVM'
?testvm=(ConfigSysID)='TESTVM'
... more local code...
If \?prodvm & \?testvm then 'EXEC OPDISAST'
... more local code...
The "OPDISAST EXEC" (code yours however you see fit, it's just a "Simple
Matter of Programming") should prompt the Operator along the lines of:
/* Prolog; See Epilog for additional information ********************
* Exec Name - OPDISAST EXEC for OPERATOR 191 disk. *
* Unit Support - IS *
* Status - Version 1, Release 2.1 *
********************************************************************/
address 'COMMAND'
parse source xos xct xfn xft xfm xcmd xenvir .
parse upper arg parms 0 operands '(' options ')' parmrest
'CP SPOOL CONSOLE * START'
'CP SET RUN ON'
?test=wordpos('TEST',parms)>0
hi='1DE8'x /* 3270 hilight attrs */
lo='1D60'x /* 3270 std attributes */
...more code as needed...
'STATE CLRSCRN MODULE *' /* Preferred */
If rc=0 then clear='CLRSCRN MORE'
else clear='VMFCLEAR' /* Fall-back */
doc=hi'To HOLD the screen press "ENTER",'lo||,
'or to proceed press "CLEAR"'
cleardoc=left(doc,78-length(clear))||'('clear')'
Call Clear /* Just so it shows in a console listing */
say hi
say '>>>>>>>>> WARNING <<<<<<<>>>>>>> WARNING <<<<<<<>>>>>>> WARNING'
'<<<<<<<<<'
say ''
say center('The response to a "QUERY CPUID" does not match that',78)
say center('normally returned at "home".',78)
say lo
parse value diag(08,'CP QUERY VIRTUAL CONSOLE') with ,
. . . consrdev .
Do forever
say hi
say center('Is this system running on a Disaster Recovery',
'processor?',79)
say lo
say center('Reply:'hi'1, 2, or 3'lo,79)
say
say 'Replying:'hi||1||lo,
'will take some automated actions to bring'
say ' the PRODVM system up as if it were running'
say ' on its normal machine located in the PROD'
say ' DATA CENTER.'
say
say 'Replying:'hi||2||lo,
'will initiate standard production' ,
'procedures, likely'
say ' causing vendor programs to fail, and multitudes'
say ' of other'hi'"BAD"'lo'things to happen!'
say ' If the VMPROD processor has changed, contact' ,
'the z/VM sysprogs to update "SYSTEM CONFIG".'
say
say ' DO NOT simply reply "2" unless the zVM' ,
'sysprogs have instructed you to do so!'
say
say 'Replying:'hi||3||lo,
'will initiate automated D.R. procedures'
say ' which will write two files to the A-disk'
say ' preventing AUTOLOG2 from autologging IDs;'
say ' a tape will be requested from which'
say ' SDFs (System Data Files) will be restored'
say ' to spool.
say ' This is used only at a D.R. vendor site'
say ' AWAY from home.'
say
say 'Replying:'hi||'QUIT'||lo
say ' will issue a "CP SYSTEM RESET" on this userid'
say ' giving the VM Systems Programming staff a chance'
say ' to change files as needed. This is recommended'
say ' ONLY for the VM Systems Programming staff!'
parse upper pull response 2 .
If abbrev('QUIT',response,1) then
Do
say center(hi'Response accepted:' response,79)
say center(' Issuing CP SYSTEM RESET to permit' ,
'file updates.',79)
say lo
'CP SLEEP 2 SEC'
'CP SYSTEM RESET'
Exit 'How did we *ever* get **here**!!??'
End
If response=1 then
Do
say center(hi'Response accepted:' response,79)
say time() 'Making changes for Business Resumption at' ,
'the PRODUCTION DATA CENTER.'
'CP SLEEP 2 SEC'
Call RecoveryAtHome /* For YOU to write, a S.M.O.P. */
say
say time() 'VM:Operator should start in a few moments...'
'CP SLEEP 5 SEC'
'CP SPOOL CONSOLE CLOSE'
Call Exit 0
End
If response=2 then
Do
say center(hi'Response accepted:' response,79)
say center(' Bypassing Disaster Recovery conversions.'lo,79)
say center(' You *better* know what you are doing!'lo,79)
'CP SLEEP 2 SEC'
Call Exit 0
End
If response=3 then Leave
say 'Read the message and respond YES or NO.'
End ---<snip>---
Obviously, there's more that can/should be done. The system is your
oyster.
The key is that the matching "System_ID" (which was selected at IPL from
the "SYSTEM CONFIG" file based on the CPU serial number) is displayed in
the bottom right-hand corner of a 3270 terminal display, and is also
returned from a 'CP QUERY USERID'. E.g.
---<snip>---
parse value diag(08,'QUERY USERID') with ,
self . ConfigSysID . '15'x .
---<snip>---
On the other hand, the CMS command "IDENTIFY" returns the value from the
"SYSTEM NETID S" file. e.g.
---<snip>---
identify
OPERATOR AT PRODVM VIA RSCS 02/02/10 16:10:11 CST TUESDAY
Ready;
---<snip>---
During our D.R. tests, our "PRODVM" (the name has been changed for these
examples) comes up on a different box in another of our data centers.
Our "NODAL CONFIG Y" mentioned above gives us a single place to enter real
device addresses based on serial number. No CP Directory changes
required. No hunting though exits of myriad products. It contains (in
part):
---<snip>---
* This file is read by various REXX and GCS programs for use in
* normal operations and in disaster recovery both in Lincolnshire and
* at a disaster recovery vendor.
*SysCfgID Svm_Name Dtyp Rdev Comment
* CPC4 LPAR2 as of 20090214
PRODVM VTAM CTCA 0D61 'CTC 0D61 to SYSE, Read1'
PRODVM VTAM CTCA 1D61 'CTC 1D61 to SYSE, Read2'
PRODVM VTAM CTCA 0D62 'CTC 0D62 to SYSE, Write1'
PRODVM VTAM CTCA 1D62 'CTC 1D62 to SYSE, Write2'
PRODVM TCPIP OSA 0140 0140
PRODVM TCPIP OSA 0141 0141
PRODVM TCPIP OSA 0142 0142
* CPC4 LPAR2 as of Feb 2008?
PRODVM VTAM CTCA 0D71 'CTC 0D71 to SYSF, Read1'
PRODVM VTAM CTCA 1D71 'CTC 1D71 to SYSF, Read2'
PRODVM VTAM CTCA 0D72 'CTC 0D72 to SYSF, Write1'
PRODVM VTAM CTCA 1D72 'CTC 1D72 to SYSF, Write2'
PRODVM TCPIP OSA 0140 0140
PRODVM TCPIP OSA 0141 0141
PRODVM TCPIP OSA 0142 0142
* CPC4 LPAR3
TESTVM VTAM CTCA 0C50 'CTC 0C50 to SYSE, Read1'
TESTVM VTAM CTCA 0C51 'CTC 0C51 to SYSE Write1'
* CPC5 LPAR3
RECOVERY VTAM CTCA 0D61 'CTC 0D61 to SYSE for SNA terminals, Read1'
RECOVERY VTAM CTCA 1D61 'CTC 1D61 to SYSE for SNA terminals, Read2'
RECOVERY VTAM CTCA 0D62 'CTC 0D62 to SYSE for SNA terminals, Write1'
RECOVERY VTAM CTCA 1D62 'CTC 1D62 to SYSE for SNA terminals, Write2'
RECOVERY TCPIP OSA 6051 0140
RECOVERY TCPIP OSA 6052 0141
RECOVERY TCPIP OSA 6053 0142
---<snip>---
When VTAM and TCPIP come up, local code in them reads the "NODAL CONFIG Y"
file, making appropriate adjustments.
Some service machines have been updated to check the "ConfigSysID", making
appropriate changes to their start-up, or operating processes.
For example, we don't want VMSCHED kicking off during D.R., trying to do
things that the users and we would regret, so its 'PROFILE EXEC' issues:
---<snip>---
/* ------------------------------------------------------------- */
/* Especially at RECOVERY until we are ready for production */
/* servers and work - stop INITIATE from happening. */
/* ------------------------------------------------------------- */
parse value diag(08,'QUERY USERID') with . . ConfigSysId . '15'x .
?prodvm=(ConfigSysId='PRODVM')
If \?prodvm then
'PIPE (NAME DoNotInitiate)' ,
'| STRLITERAL /INITIATE OFF ALL/' ,
'| PAD 80' ,
'| > VMSCHED INITIATE A F 80'
---<snip>---
We've been asked during testing to not _write_ to the Virtual Tape Servers
so that time required to restore normal production is reduced. Changes
made to VM:Tape exits only permit READ mounts during D.R. tests
(identified by the "RECOVERY" System_ID).
Our z/VM system is generally up and running within 30 minutes of being
given the recovery machine upon which to IPL the 'PRODVM' system. Consider
also: we are using remotely dual-copied *and* remotely mirrored DASD -
which does not need to be restored. Our users have been very good about
testing, usually completing their test within an hour. From IPL, through
program product CPUID record updates, user testing acceptance, to SHUTDOWN
is often about 2 hours. The z/OS folks here, with a MUCH MORE complicated
sysplex environment including many, many CICS regions, DB2's and lots
more, are generally here for 14+ hours (through cutover back to prod).
I've crafted most of these auto-recovery solutions over my 25 years at
Hewitt Associates. "Continuous Quality Improvements" is more than just a
slogan - it lets me sleep through much of our D.R. test time. What used
to be D.R. tests offsite for 48+ hours of non-stop sysprog work without
any sleep has become a veritable cake walk (sometimes they do bring in
cake, along with better foodstuffs, for the tired, huddled masses!).
Being able to test on a different machine in our other data center, using
remote-copy DASD (no restores required), is a huge part of that. But
automating the recovery to be able to come up with minimal sysprog
intervention was also very significant.
Critically: the "ConfigSysID" automation using the "SYSTEM CONFIG"
System_IDs provides the company with the means to bring the system up even
if the disaster includes the VM sysprogs. Good new for them, perhaps not
so good for us. ;-)
Mike Walter
Hewitt Associates
The opinions expressed herein are mine alone, not my employer's.
The information contained in this e-mail and any accompanying documents may
contain information that is confidential or otherwise protected from
disclosure. If you are not the intended recipient of this message, or if this
message has been addressed to you in error, please immediately alert the sender
by reply e-mail and then delete this message, including any attachments. Any
dissemination, distribution or other use of the contents of this message by
anyone other than the intended recipient is strictly prohibited. All messages
sent to and from this e-mail address may be monitored as permitted by
applicable law and regulations to ensure compliance with our internal policies
and to protect our business. E-mails are not secure and cannot be guaranteed to
be error free as they can be intercepted, amended, lost or destroyed, or
contain viruses. You are deemed to have accepted these risks if you communicate
with us by e-mail.