Re: First time DR excercise

O'Brien, Dennis L Tue, 02 Feb 2010 18:59:02 -0800

Mike has a lot of good ideas here, but I'll comment on a couple of minor points.


1. We have multiple VM systems in our shop, so changing the node name during a 
disaster would cause more problems than it would solve.  We have too much code 
that does things like "If node = 'VMSYS1' Then Do ..." to be able to tolerate 
new node names for DR.  We update the logo file to change "RUNNING" in the 
lower right corner of the screen to "DR SYS".  The node name stays the same.

2. The other thing I would never change during DR is the time zone.  Your users 
have scheduled jobs based on the normal time zone of the system.  If you change 
the time zone, those jobs will run early or late.  If the jobs run late, users 
will be late getting their reports, data feeds, or whatever.  If the jobs run 
early, they may run before feeds from other systems have arrived.  Either one 
is not good.
                                                                                
                                           Dennis

"See, I'm a man of simple tastes. I like dynamite, and gunpowder... And 
gasoline! Do you know what all of these things have in common? They're cheap!"  
-- Heath Ledger as The Joker, in The Dark Knight


-----Original Message-----
From: The IBM z/VM Operating System [mailto:[email protected]] On Behalf 
Of Mike Walter
Sent: Tuesday, February 02, 2010 15:08
To: [email protected]
Subject: Re: [IBMVM] First time DR excercise

Is  your management 100% certain that YOU will survive the unplanned 
disaster?  Do you agree with them?  What happens when there is a natural 
disaster where you do survive, but your family needs you immediately?  Do 
you choose to go to the D.R. site for an unspecified period of time, 
hoping that someone else will take care of your family's needs (some could 
involve hospitalization, right?). 

In that vein, see below for alternative to which I am naturally partial 
since it allows easy, well-tested, automated changes when running on a 
different CPUs.  That's especially handy when an experienced systems 
programmer is not available.  I have always tied to write my VM D.R. plan 
so that a typical VM operator, or even a non-VM manager (!) could bring up 
the VM system at least to the point that program product cpuid keys need 
to be updated for the recovery CPU.

First, if you plan to restore your production SPOOL data at your D.R. 
site, placement (slot numbers) of the SPOOL volumes in the SYSTEM CONFIG's 
"OWN"ed list does not matter *if* you are formatting them and restoring 
from an SPXTAPE backup (or a product that calls SPXTAPE, such as 
VM:Spool).  However, I share with you that...
:religion on
1) it is always a "Best Practice" to place your SPOOL volumes in the 
SYSTEM CONFIG either starting in SLOT 1, 
2) with plenty of RESERVED slots below for adding more (RESERVED slots 
have very little overhead, IBM won't even charge you more for them!) e.g. 
PRODVM: RECOVERY: CP_Owned   Slot  1   VMSP01     OWN 
PRODVM: RECOVERY: CP_Owned   Slot  2   VMSP02     OWN 
PRODVM: RECOVERY: CP_Owned   Slot  3   VMSP03     OWN 
                  CP_Owned   Slot  4   RESERVED 
... and more RESERVED slots for future SPOOL growth as needed... 
3) or in the very last slots (i.e. ending with SLOT 255 backing up from 
there with plenty of RESERVED slots above),

... and more RESERVED slots above for future SPOOL growth as needed...  
                  CP_Owned   Slot  252             RESERVED 
PRODVM: RECOVERY: CP_Owned   Slot  253  VMSP03     OWN
PRODVM: RECOVERY: CP_Owned   Slot  254  VMSP02     OWN 
PRODVM: RECOVERY: CP_Owned   Slot  255  VMSP01     OWN 
  4) Regardless of their top or bottom location, include "flashing 
red-neon" comments (that's just a point of emphasis, don't search IBM doc 
for "flashing red-neon" comments), warning you and your sysprog heirs to 
NEVER CHANGE SPOOL VOLUME SLOT NUMBERS WITHOUT A CURRENT SPXTAPE BACKUP 
(and a thorough plan already tested on a 2nd level system, and ... an 
off-site copy of your resume)!
:religion off

Now, about that SYSTEM CONFIG file. 
----------------------------------- 
Let's presume that your normal product system runs on  an ancient, creaky 
old z800, with serial number 12345.

And that your employer has a number of shiny newer boxes with known serial 
numbers, upon any of which you might be able to RECOVER your production 
z/VM system should your creaky rusted decrepit old z800 crash and take a 
while to repair -- as parts are shipped from some far-off location where 
old machines are stored (maybe in the desert near all old those old 
aircraft?).

If none of those "RECOVERY" systems are available (i.e. it really was a 
*true* DISASTER), then you'll come up on a disaster recovery provider's 
"DISASTER" machine.

Consider including in the SYSTEM CONFIG something along the lines of:

---<snip>---  
/* ----------------------------------------------------------- */ 
/* Standard operating environment.  See also "NODAL CONFIG Y"  */ 
/* which contains some hard-coded device addresses that change */ 
/* when running on different systems based upon the defined    */ 
/* "System_Identifer" from this file.                          */ 
/* Warning!: The last (of duplicate) System_ID record with     */ 
/* matching model and cpuid value overrides previous System_IDs*/ 
/* ----------------------------------------------------------- */ 
System_ID 2066 %%2345 PRODVM 
System_ID 2094 %%nnn1 RECOVERY 
System_ID 2094 %%nnn2 RECOVERY 
System_ID 2097 %%nnn3 RECOVERY 
System_ID 2097 %%nnn4 RECOVERY 
 
/* Use DISASTER only when we're not in our data center so     */ 
/* that extensive D.R. automation can take place on various   */ 
/* service machines when running PROFILE EXEC/PROFILE GCS's.  */ 
System_Identifier_Default     DISASTER 

/**********************************************************************/ 
/*                        Timezone Definitions                        */ 
/* (Next assumes Central time for prod, Eastern time for DISASTER)    */
/**********************************************************************/ 
 
   PRODVM: TESTVM: Timezone_Definition CST West 06.00.00 
   PRODVM: TESTVM: Timezone_Definition CDT West 05.00.00 
 
   RECOVERY: Timezone_Definition CST West 06.00.00 
   RECOVERY: Timezone_Definition CDT West 05.00.00 
 
   DISASTER: Timezone_Definition EST West 05.00.00 
   DISASTER: Timezone_Definition EDT West 04.00.00 
---<snip>--- 


What's that you say?  "What are those lines beginning with 'PRODVM:', 
'RECOVERY:', 'DISASTER:', etc.?"
Well, those are the SYSTEM CONFIG's "RECORD QUALIFERS" - oft 
overlooked/VERY POWERFUL!  Those names "PRODVM", "RECOVERY", "DISASTER" 
(or anything you choose to use) were defined above when the CPUID matched 
one listed in the "System_ID" records.  I place the "System_ID" records 
near the top of our SYSTEM CONFIG, so that they are available for use 
everywhere below that point. 

Any record that begins with one of the "RECORD QUALIFIER" statements, will 
only be processed on a CPU with a matching serial number. 

So, to automate start-up (maybe you can sleep through a test!?), in 
"PROFILE EXEC" on OPERATOR, include some code along the lines of:
---<snip>---
parse value diag(08,'QUERY USERID') with , 
            self . ConfigSysID . '15'x . 
?prodvm=(ConfigSysID)='PRODVM' 
?testvm=(ConfigSysID)='TESTVM' 
... more local code...
If \?prodvm & \?testvm then 'EXEC OPDISAST'
... more local code...

The "OPDISAST EXEC" (code yours however you see fit, it's just a "Simple 
Matter of Programming") should prompt the Operator along the lines of:
/* Prolog; See Epilog for additional information ******************** 
 * Exec Name     - OPDISAST EXEC for OPERATOR 191 disk.             * 
 * Unit Support  - IS                                               * 
 * Status        - Version 1, Release 2.1                           * 
 ********************************************************************/
 
   address 'COMMAND' 
   parse source xos xct xfn xft xfm xcmd xenvir . 
   parse upper arg parms 0 operands '(' options ')' parmrest 
 
  'CP SPOOL CONSOLE * START' 
  'CP SET RUN ON' 
   ?test=wordpos('TEST',parms)>0 
 
   hi='1DE8'x                               /* 3270 hilight attrs   */
   lo='1D60'x                               /* 3270 std attributes  */
 
...more code as needed... 

   'STATE CLRSCRN MODULE *'                             /* Preferred */ 
   If rc=0 then clear='CLRSCRN MORE' 
           else clear='VMFCLEAR'                        /* Fall-back */ 
   doc=hi'To HOLD the screen press "ENTER",'lo||, 
         'or to proceed press "CLEAR"' 
   cleardoc=left(doc,78-length(clear))||'('clear')' 
   Call Clear              /* Just so it shows in a console listing */ 
   say hi 
   say '>>>>>>>>> WARNING <<<<<<<>>>>>>> WARNING <<<<<<<>>>>>>> WARNING'
       '<<<<<<<<<' 
   say '' 
   say center('The response to a "QUERY CPUID" does not match that',78) 
   say center('normally returned at "home".',78) 
   say lo 
   parse value diag(08,'CP QUERY VIRTUAL CONSOLE') with , 
               . . . consrdev . 
 
   Do forever 
      say hi 
      say center('Is this system running on a Disaster Recovery', 
                 'processor?',79) 
      say lo 
      say center('Reply:'hi'1, 2, or 3'lo,79) 
      say 
      say 'Replying:'hi||1||lo, 
                       'will take some automated actions to bring' 
      say '             the PRODVM system up as if it were running' 
      say '             on its normal machine located in the PROD' 
      say '             DATA CENTER.' 
      say 
      say 'Replying:'hi||2||lo, 
                       'will initiate standard production' , 
                       'procedures, likely' 
      say '             causing vendor programs to fail, and multitudes'
      say '             of other'hi'"BAD"'lo'things to happen!' 
      say '             If the VMPROD processor has changed, contact' , 
                       'the z/VM sysprogs to update "SYSTEM CONFIG".' 
      say 
      say '             DO NOT simply reply "2" unless the zVM' , 
                       'sysprogs have instructed you to do so!' 
      say 
      say 'Replying:'hi||3||lo, 
                       'will initiate automated D.R. procedures' 
      say '             which will write two files to the A-disk' 
      say '             preventing AUTOLOG2 from autologging IDs;' 
      say '             a tape will be requested from which' 
      say '             SDFs (System Data Files) will be restored' 
      say '             to spool. 
      say '             This is used only at a D.R. vendor site' 
      say '             AWAY from home.' 
      say 
      say 'Replying:'hi||'QUIT'||lo 
      say '             will issue a "CP SYSTEM RESET" on this userid' 
      say '             giving the VM Systems Programming staff a chance'
      say '             to change files as needed.  This is recommended'
      say '             ONLY for the VM Systems Programming staff!' 
      parse upper pull response 2 . 
      If abbrev('QUIT',response,1) then 
         Do 
           say center(hi'Response accepted:' response,79) 
           say center(' Issuing CP SYSTEM RESET to permit' , 
                      'file updates.',79) 
           say lo 
          'CP SLEEP 2 SEC' 
          'CP SYSTEM RESET' 
           Exit 'How did we *ever* get **here**!!??'  
         End 
      If response=1 then 
         Do 
           say center(hi'Response accepted:' response,79) 
           say time() 'Making changes for Business Resumption at' , 
                      'the PRODUCTION DATA CENTER.'  
          'CP SLEEP 2 SEC' 
 
           Call RecoveryAtHome   /* For YOU to write, a S.M.O.P. */ 
 
           say 
           say time() 'VM:Operator should start in a few moments...' 
          'CP SLEEP 5 SEC' 
          'CP SPOOL CONSOLE CLOSE' 
           Call Exit 0 
         End 
      If response=2 then 
         Do 
           say center(hi'Response accepted:' response,79) 
           say center(' Bypassing Disaster Recovery conversions.'lo,79) 
           say center(' You *better* know what you are doing!'lo,79)  
          'CP SLEEP 2 SEC' 
           Call Exit 0 
         End 
      If response=3 then Leave 
      say 'Read the message and respond YES or NO.' 
   End  ---<snip>---


Obviously, there's more that can/should be done.  The system is your 
oyster.

The key is that the matching "System_ID" (which was selected at IPL from 
the "SYSTEM CONFIG" file based on the CPU serial number) is displayed in 
the bottom right-hand corner of a 3270 terminal display, and is also 
returned from a 'CP QUERY USERID'.  E.g.
---<snip>---
parse value diag(08,'QUERY USERID') with , 
            self . ConfigSysID . '15'x . 
---<snip>---


On the other hand, the CMS command "IDENTIFY" returns the value from the 
"SYSTEM NETID S" file. e.g.
---<snip>---
identify 
OPERATOR AT PRODVM VIA RSCS     02/02/10 16:10:11 CST      TUESDAY 
Ready;
---<snip>---

During our D.R. tests, our "PRODVM" (the name has been changed for these 
examples) comes up on a different box in another of our data centers.

Our "NODAL CONFIG Y" mentioned above gives us a single place to enter real 
device addresses based on serial number.  No CP Directory changes 
required.  No hunting though exits of myriad products.  It contains (in 
part):
---<snip>---
* This file is read by various REXX and GCS programs for use in 
* normal operations and in disaster recovery both in Lincolnshire and 
* at a disaster recovery vendor. 
 
*SysCfgID Svm_Name Dtyp Rdev Comment 
 
* CPC4 LPAR2 as of 20090214 
 PRODVM   VTAM     CTCA 0D61 'CTC 0D61 to SYSE, Read1' 
 PRODVM   VTAM     CTCA 1D61 'CTC 1D61 to SYSE, Read2' 
 PRODVM   VTAM     CTCA 0D62 'CTC 0D62 to SYSE, Write1' 
 PRODVM   VTAM     CTCA 1D62 'CTC 1D62 to SYSE, Write2' 
 
 PRODVM   TCPIP    OSA  0140 0140 
 PRODVM   TCPIP    OSA  0141 0141 
 PRODVM   TCPIP    OSA  0142 0142 
 
* CPC4 LPAR2 as of Feb 2008? 
 PRODVM   VTAM     CTCA 0D71 'CTC 0D71 to SYSF, Read1' 
 PRODVM   VTAM     CTCA 1D71 'CTC 1D71 to SYSF, Read2' 
 PRODVM   VTAM     CTCA 0D72 'CTC 0D72 to SYSF, Write1' 
 PRODVM   VTAM     CTCA 1D72 'CTC 1D72 to SYSF, Write2' 
 
 PRODVM   TCPIP    OSA  0140 0140 
 PRODVM   TCPIP    OSA  0141 0141 
 PRODVM   TCPIP    OSA  0142 0142 
 
* CPC4 LPAR3 
 TESTVM   VTAM     CTCA 0C50 'CTC 0C50 to SYSE, Read1' 
 TESTVM   VTAM     CTCA 0C51 'CTC 0C51 to SYSE  Write1' 
 
* CPC5 LPAR3 
 RECOVERY VTAM     CTCA 0D61 'CTC 0D61 to SYSE for SNA terminals, Read1' 
 RECOVERY VTAM     CTCA 1D61 'CTC 1D61 to SYSE for SNA terminals, Read2' 
 RECOVERY VTAM     CTCA 0D62 'CTC 0D62 to SYSE for SNA terminals, Write1'
 RECOVERY VTAM     CTCA 1D62 'CTC 1D62 to SYSE for SNA terminals, Write2'
 
 RECOVERY TCPIP    OSA  6051 0140 
 RECOVERY TCPIP    OSA  6052 0141 
 RECOVERY TCPIP    OSA  6053 0142 
---<snip>---

When VTAM and TCPIP come up, local code in them reads the "NODAL CONFIG Y" 
file, making appropriate adjustments.
Some service machines have been updated to check the "ConfigSysID", making 
appropriate changes to their start-up, or operating processes. 

For example, we don't want VMSCHED kicking off during D.R., trying to do 
things that the users and we would regret, so its 'PROFILE EXEC' issues:
---<snip>---
/* ------------------------------------------------------------- */ 
/* Especially at RECOVERY until we are ready for production      */ 
/* servers and work - stop INITIATE from happening.              */ 
/* ------------------------------------------------------------- */ 
   parse value diag(08,'QUERY USERID') with . . ConfigSysId . '15'x .
   ?prodvm=(ConfigSysId='PRODVM') 
   If \?prodvm then 
     'PIPE (NAME DoNotInitiate)' , 
        '| STRLITERAL /INITIATE OFF ALL/' , 
        '| PAD 80' , 
        '| > VMSCHED INITIATE A F 80' 
---<snip>---

We've been asked during testing to not _write_ to the Virtual Tape Servers 
so that time required to restore normal production is reduced.  Changes 
made to VM:Tape exits only permit READ mounts during D.R. tests 
(identified by the "RECOVERY" System_ID).

Our z/VM system is generally up and running within 30 minutes of being 
given the recovery machine upon which to IPL the 'PRODVM' system. Consider 
also: we are using remotely dual-copied *and* remotely mirrored DASD - 
which does not need to be restored.  Our users have been very good about 
testing, usually completing their test within an hour. From IPL, through 
program product CPUID record updates, user testing acceptance, to SHUTDOWN 
is often about 2 hours.  The z/OS folks here, with a MUCH MORE complicated 
sysplex environment including many, many CICS regions, DB2's and lots 
more, are generally here for 14+ hours (through cutover back to prod).

I've crafted most of these auto-recovery solutions over my 25 years at 
Hewitt Associates.  "Continuous Quality Improvements" is more than just a 
slogan - it lets me sleep through much of our D.R. test time.  What used 
to be D.R. tests offsite for 48+ hours of non-stop sysprog work without 
any sleep has become a veritable cake walk (sometimes they do bring in 
cake, along with better foodstuffs, for the tired, huddled masses!). 

Being able to test on a different machine in our other data center, using 
remote-copy DASD (no restores required), is a huge part of that.  But 
automating the recovery to be able to come up with minimal sysprog 
intervention was also very significant. 

Critically: the "ConfigSysID" automation using the "SYSTEM CONFIG" 
System_IDs provides the company with the means to bring the system up even 
if the disaster includes the VM sysprogs.  Good new for them, perhaps not 
so good for us.  ;-)

Mike Walter
Hewitt Associates
The opinions expressed herein are mine alone, not my employer's.



The information contained in this e-mail and any accompanying documents may 
contain information that is confidential or otherwise protected from 
disclosure. If you are not the intended recipient of this message, or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message, including any attachments. Any 
dissemination, distribution or other use of the contents of this message by 
anyone other than the intended recipient is strictly prohibited. All messages 
sent to and from this e-mail address may be monitored as permitted by 
applicable law and regulations to ensure compliance with our internal policies 
and to protect our business. E-mails are not secure and cannot be guaranteed to 
be error free as they can be intercepted, amended, lost or destroyed, or 
contain viruses. You are deemed to have accepted these risks if you communicate 
with us by e-mail.

Re: First time DR excercise

Reply via email to