Hi Timothy,

You bring up some interesting points but I'd like to look at it from a 
different POV.  Do most companies have a fully redundant hot-site where they 
can move a single piece of equipment (whether it be a mainframe or a 
midrange/Unix system or windows server) to the alternate site with the flip of 
a switch?  While I have never had to partake in a real disaster, I've 
participated in tabletop DR scenarios and *always* one of the first things that 
has been discussed has been "What is the cost (in dollars as well as outage 
time/down time and people cost for the disruption) of moving to the DR site 
versus just biting the bullet and waiting for the primary site to come up?"   
Now combine that with the fact that WV is running apparently severely outdated 
equipment (self-inflicted wound here) supported by a third party vendor.  
Depending on how old the equipment actually is, would there even be a third 
party DR site that would be able to support their system?  

It wasn't too many years ago that my site at the time had a DR contract with 
IBM BCRS.  We were running an older mainframe that was still supported by IBM 
but when it came to doing a DR test we couldn't go to the BCRS site our 
contract was with (and we had a permanent network link to) because that 
particular BCRS site had no hardware old enough to support us.  We had to go to 
a different BCRS site and IBM had to back-haul our network connection from 
their primary site to the one they put us in.  

My guess is that WV backed themselves into a corner by not allowing their IT 
staff the resources to keep their systems by woefully underfunding the capital 
needed to keep it running and now the "savior-de-jour" is coming in riding on a 
cloud and that's going to save the day for them.

Rex

-----Original Message-----
From: IBM Mainframe Discussion List <[email protected]> On Behalf Of 
Timothy Sipples
Sent: Wednesday, July 27, 2022 12:12 AM
To: [email protected]
Subject: [EXTERNAL] Re: Mainframe outage affecting W.Va. state agencies could 
take 48, 72 hours to resolve

I have absolutely no information about this incident other than what the media 
are reporting. I wish everyone involved the best success.

My *personal* curiosity revolves around the Disaster Recovery plan and 
resources. As I'm sure we all know the standard/typical operational practice is 
to have an alternate site, separated at some distance, equipped with standby 
resources. Disk subsystems replicate between sites (primary to alternate) 
either synchronously or asynchronously. Or at least there'd be a remote tape 
library, preferably virtual to some degree (for performance reasons), 
preferably with multiple incremental backups per day. If the primary site is 
lost, for whatever reason(s), the IT operations team restores at least critical 
services from the alternate site. It might be a long RTO (24 hours for example) 
if it's a basic/entry DR arrangement, but it'd be something.

Over many years I've only ever worked with two clients that had no real DR plan 
and essentially no DR resources when I first met them. As it happens they were 
both government agencies, but they were also both located in fairly poor or 
poorer developing countries. One client took frequent tape backups and shuttled 
physical tapes off-site so at least they'd be able to recover to some point, 
eventually. (RTO="a week or two," RPO=12+ hours probably.) I wasn't happy they 
had to operate that way, but their constraints were genuine. I worked with the 
other government agency to eliminate their exposure within a tight budget, and 
they now have an alternate site with a reasonable DR capability.

I also remember working with another customer in a developing country, a bank. 
They were upgrading their systems, and their original plan involved losing DR 
protections for a couple days (about 48 hours) as I recall. That plan troubled 
me, so I worked with them to create a better, safer plan that preserved DR 
coverage throughout the upgrade project. They chose the revised plan. They 
completed their upgrade project on-time, within budget, and without incident.

So what happened to the alternate site (and DR switchover to it)?

- - - - -
Timothy Sipples
Senior Architect
Digital Assets, Industry Solutions, and Cybersecurity IBM zSystems/LinuxONE, 
Asia-Pacific [email protected]


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions, send email to 
[email protected] with the message: INFO IBM-MAIN

----------------------------------------------------------------------
The information contained in this message is confidential, protected from 
disclosure and may be legally privileged. If the reader of this message is not 
the intended recipient or an employee or agent responsible for delivering this 
message to the intended recipient, you are hereby notified that any disclosure, 
distribution, copying, or any action taken or action omitted in reliance on it, 
is strictly prohibited and may be unlawful. If you have received this 
communication in error, please notify us immediately by replying to this 
message and destroy the material in its entirety, whether in electronic or hard 
copy format. Thank you.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to