Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder

Bruce Montague Wed, 12 Mar 2014 11:46:32 -0700

Hi, regarding the call to create a list of disaster recovery (DR) use cases ( 
http://lists.openstack.org/pipermail/openstack-dev/2014-March/028859.html ), 
the following list sketches some speculative OpenStack DR use cases. These use 
cases do not reflect any specific product behavior and span a wide spectrum. 
This list is not a proposal, it is intended primarily to solicit additional 
discussion. The first basic use case, (1), is described in a bit more detail 
than the others; many of the others are elaborations on this basic theme.




* (1) [Single VM]

A single Windows VM with 4 volumes and VSS (Microsoft's Volume Shadowcopy 
Services) installed runs a key application and integral database. VSS can 
quiesce the app, database, filesystem, and I/O on demand and can be invoked 
external to the guest.

   a. The VM's volumes, including the boot volume, are replicated to a remote 
DR site (another OpenStack deployment).

   b. Some form of replicated VM or VM metadata exists at the remote site. This 
VM/description includes the replicated volumes. Some systems might use cold 
migration or some form of wide-area live VM migration to establish this remote 
site VM/description.

   c. When specified by an SLA or policy, VSS is invoked, putting the VM's 
volumes in an application-consistent state. This state is flushed all the way 
through to the remote volumes. As each remote volume reaches its 
application-consistent state, this is recognized in some fashion, perhaps by an 
in-band signal, and a snapshot of the volume is made at the remote site. Volume 
replication is re-enabled immediately following the snapshot. A backup is then 
made of the snapshot on the remote site. At the completion of this cycle, 
application-consistent volume snapshots and backups exist on the remote site.

   d.  When a disaster or firedrill happens, the replication network connection 
is cut. The remote site VM pre-created or defined so as to use the replicated 
volumes is then booted, using the latest application-consistent state of the 
replicated volumes. The entire VM environment (management accounts, networking, 
external firewalling, console access, etc..), similar to that of the primary, 
either needs to pre-exist in some fashion on the secondary or be created 
dynamically by the DR system. The booting VM either needs to attach to a 
virtual network environment similar to at the primary site or the VM needs to 
have boot code that can alter its network personality. Networking configuration 
may occur in conjunction with an update to DNS and other networking 
infrastructure. It is necessary for all required networking configuration  to 
be pre-specified or done automatically. No manual admin activity should be 
required. Environment requirements may be stored in a DR configuration or databa
 se associated with the replication. 

   e. In a firedrill or test, the virtual network environment at the remote 
site may be a "test bubble" isolated from the real network, with some provision 
for protected access (such as NAT). Automatic testing is necessary to verify 
that replication succeeded. These tests need to be configurable by the end-user 
and admin and integrated with DR orchestration.

   f. After the VM has booted and been operational, the network connection 
between the two sites is re-established. A replication connection between the 
replicated volumes is restablished, and the replicated volumes are re-synced, 
with the roles of primary and secondary reversed. (Ongoing replication in this 
configuration may occur, driven from the new primary.)

   g. A planned failback of the VM to the old primary proceeds similar to the 
failover from the old primary to the old replica, but with roles reversed and 
the process minimizing offline time and data loss.



* (2) [Core tenant/project infrastructure VMs] 

Twenty VMs power the core infrastructure of a group using a private cloud 
(OpenStack in their own datacenter). Not all VMs run Windows with VSS, some run 
Linux with some equivalent mechanism, such as qemu-ga, driving fsfreeze and 
signal scripts. These VMs are replicated to a remote OpenStack deployment, in a 
fashion similar to (1). Orchestration occurring at the remote site on failover 
is more complex (correct VM boot order is orchestrated, DHCP service is 
configured as expected, all IPs are made available and verified). An equivalent 
virtual network topology consisting of multiple networks or subnets might be 
pre-created or dynamically created at failover time. 

   a. Storage for all volumes of all VMs might be on a single storage backend 
(logically a single large volume containing many smaller sub-volumes, examples 
being a VMware datastore or Hyper-V CSV). This entire large volume might be 
replicated between similar storage backends at the primary and secondary site. 
A single replicated large volume thus replicates all the tenant VM's volumes. 
The DR system must trigger quiesce of all volumes to application-consistent 
state.

   b. This environment needs to deal with failures of the primary datacenter 
(as when a trenching tool cuts its connection to the internet), routine 
firedrill tests that perform failover and failback, and planned migration.

   c. VSS or fsfreeze may be expected to fail for some VMs and policies and 
SLAs need to contend with this and alert admins for manual follow-up. 

   d. Network bandwidth used for replication needs to be throttled so as not to 
overly disrupt the private cloud's gateway capacity. 

   e. DR replication needs to deal with intermittent network replication 
failure and recover gracefully. In case of a known network issue, such as 
maintenance, it needs to be possible for the admin to explicitly suspend 
network replication. Replication I/O is then logged locally at the primary site 
in some fashion. The remote site needs to stay replication ready, but failover 
does not occur. When the network issue is over, replication resumes, perhaps 
recovering via a log, a map of updated blocks, or an equivalent technique. In 
this example the RPO window is deliberately ignored and allowed to grow until 
replication is resumed by the admin.

   f. This tenant requires encryption of network replication traffic.

   g. Cost accounting and chargeback is required.



* (3) [Multi-tier app infrastructure] 

A tenant has a service consisting of 8 multi-tier apps that each consist of 3 
to 5 VMs, with each VM having 2 to 4 disks. Replication snapshots need to be 
made of the volumes in an application-consistent way across all the volumes of 
all the VMs in all the multi-tier apps. Again, these volumes may exist on a 
single large volume or datastore, perhaps simplifying creation of the cross-VM 
application consistency snapshot. Not all of the VMs in a multi-tier app may 
need to be quiesced, some may be stateless and simply need to be recovered to a 
running state.

a. This tenant requires that 3 of the multi-tier apps failover to one remote 
OpenStack site and the other 5 multi-tier apps failover to a different remote 
site than the first.

b. This tenant weekly performs a non-disruptive test-bubble failover test. Real 
failover is not triggered. Instead, all the multi-tier app VMs that would boot 
upon failure are booted (from their latest snapshots on the secondary), but the 
VM's virtual network environment on the secondary is isolated from external 
networking. Test bubbles at the two OpenStack remote sites may need to be 
connected via some VPN/tunnel or equivalent without manual admin activity.



* (4) [Tenant failover]

An OpenStack tenant has 40 VMs, relatively lightly loaded, used for 
development. The VMs do not contain VSS, qemu-ga, or standard tools (they may 
be running any Linux distro, some may be running Plan9, the tenant may be doing 
Linux kernel development (that is, the VMs can be anything)). A remote 
OpenStack deployment needs to exist so that in event of loss of the primary 
OpenStack site, the tenant can continue development. In addition to volume 
replication as in (1), subject to policies and SLAs, cold migration may be 
performed on a VM's volumes upon shutdown (or dismount) and tenant end-users 
can explicitly request replication of a volume that is in an 
application-consistent state (when they have quiesced it by VSS, dismount, or 
equivalent). 

a. Being down for a short period may be acceptable to this tenant. If all the 
hosts on the primary site are rebooted, for instance, due to power failure, it 
is the operators choice to fail over or not. If the operator chooses not to 
fail over, upon reboot of the VM's at the primary site, any established 
replication should automatically be continued.



* (5) [Scale-out workload] 

A tenant has a Cassandra (or Hadoop or similar type of system) consisting of 75 
VMs. Use is bursty. The system is used by a pharmaceutical company for design 
work. Loss of a week's work can be repeated, but weekly replication is 
mandatory. The application itself may provide some form of built-in 
geo-replication. Some controller-type VMs may need to be replicated as in (1). 
Other VMs may partner with replica VMs for explicit application data 
replication. For weekly replication of Cassandra data, Cassandra user-level 
snapshots are made into replicated volumes attached to each Cassandra VM. 
Replication is periodic with respect to the last replication event, that is, 
only data changed since the last replication event is sent.

   a. The tenant requires use of a particular aggregated network link for 
replication.

   b. The tenant requires custom integration with the DR replication workflow 
to quiesce Cassandra via user-level commands and scripts developed by the 
end-user.

   c. Initial synchronization of replicated primary and secondary volume need 
not be over a network link. Secondary volumes can be created initially from 
physical disks or backups physically moved to the secondary site.



* (6) [Degraded-mode Mission-critical single VM]

This single VM use case is similar to (1), but when a network partition occurs 
between the primary and secondary OpenStack sites, with both sites remaining 
up, the primary VM remains operational while the secondary replica VM also 
comes online. Both VMs operate in a mode that resembles replication with a 
momentary network fault, logging their would-be replication traffic for 
continuation when the network comes back. When network connectivity is 
reestablished, one site again becomes the primary and differences in the VM's 
volumes can optionally (as controlled by policy) be reconciled. (In a simple 
case, each site might have its own dedicated volume partition or attached 
volume with its latest state.)



* (7) [Self-contained application volume]

 A cinder volume contains a complete database application, including the 
database and all binaries and configuration files. Replication of the entire VM 
to which this volume is attached is not needed. The VM and  its configuration 
can be recreated on demand at the remote site and attached to the replicated 
application volume. The DR system still needs to orchestrate the process and 
create or manage the required network environment. A simple DR strategy can be 
used in which the volume is quiesced on the primary, a volume snapshot taken, 
the volume unquiesced (enabling the VM to continue running), and a backup is 
then made of the snapshot. Backups can be transported by whatever means to the 
DR site, where the volume can be restored to its state at time of snapshot.



* (8) [Stateless]

No volumes and VMs need to be replicated, as VMs and their configuration can be 
recreated on demand, using configuration tools, and application data is 
accessed over the wide-area network (NFS or object store). The DR process still 
has to orchestrate creating the VMs, running configuration tools to populate 
them, creating the network environment, and booting VMs in required order.



* (9) [Site Evacuation]

The holy grail, automatic planned migration of the workload and data from one 
cloud-scale datacenter to another (or a set of others). In practice, likely to 
include admins in-the-loop. At both tenant-scale and entire datacenter scale. 
The entire cloud datacenter is expected to go offline for an extended period 
(the hurricane scenario).



-bruce


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Disaster Recovery for OpenStack - call for stakeholder

Reply via email to