Matt Benjamin alluded to this in other email on the info list; given the state 
of our world it's a good idea to get the idea out to others. "The state of our 
world" doesn't mean it's coming apart, just means that we probably aren't going 
to be working on this for the forseeable future.

Dan Hyde and I were doing a system at Michigan that was intended to allow rapid 
disaster recovery for AFS using a system analogous to snapmirroring on NetApp 
filers and similar devices. This note is to provide a quick overview of where 
we were going, how far we got, and why.

Problem: at Michigan, losing an AFS file server can make some or all of the 
cell unusable (handwave, handwave on why). As the number of servers increases, 
the likelihood of this becomes higher and higher. We were looking for a way to 
minimize those losses, and glommed onto an unfinished project called 'shadow 
volumes' to do it.

Shadow volumes have lots of theoretical capabilities; we were pushing for one 
specific set of features in our implementation. Don't take our work as being 
representative of either what the initial developers intended or as the only 
possible use for it. In classic open source fashion, our development reflected 
scratching our own particular itches.

Credit where credit is due: Dan did all the heavy lifting on the code and a lot 
of the test operational deployment. And the original work was done by someone 
whose name escapes me right at this second; if time and energy permit I'll look 
that up and give that credit.

A shadow volume is a read-only remote clone of a primary volume. We had to 
create some terminology here, and 'primary' is what we called the real-time, 
in-use, r/w production volume. A remote clone closely resembles a read-only 
replica of a volume, but differs in several important respects.

First and foremost, it does not appear in the vldb. Thus there is no 
possibility of the read-only copy coming into production. If it were public 
like a r/o replica, it would generate all kinds of problems for the day to day 
use of the volume. Our solution to this follows the original developer: the 
only way to prevent use of the r/o was to not have it appear in the vldb. 
Longer term there are better ways, but this did the least violence to existing 
cells.

A shadow volume should retain a timestamp and name-or-id relationship with the 
primary. This should enable something much like a release of a replicated 
volume - incremental changes are quickly and easily propogated to the shadow. 
We call that refreshing the shadow. As the shadow was not in the vldb, this 
requires the refresh be initiated by something external to the vldb/primary. 
That code is complete and works. This was running on a nightly basis in our 
cell with an acceptably small amount of overtime - not much more than the 
nightly backup snapshots. Big kudos to Dan on this.

Shadow volumes could be detected only on the server on which they reside. 
Modification were made to vos listvol for that purpose. A bit in the volume 
header was selected for distinguishing a shadow from a primary volume; I 
believe that was the only modification made to the volume header file. This 
work is also done.

A mechanism needs to be established such that a shadow volume can be promoted 
(our term) to a primary. This mechanism would involve at least two steps: 
flipping the shadow bit in the header file to indicate the volume is a primary, 
and updating the vldb to indicate the new location of the primary. This work is 
incomplete; I don't have a feel for how much if any is done.

With these features, we could meet the minimum bar for our usage. We could, in 
theory, disastrously lose an AFS server, promote the shadows, and be back 
online in minutes. There would be data lossage for any changes which occur 
between the last refresh and the promotion, but this was judged preferable to 
having the cell down or non-functional for hours or even days.

In our initial implementation, we were building afs servers in pairs with 
shadow servers. Each server in a pair was intended for only one purpose - 
either all primary volumes, or all shadow volumes. This isn't the only way to 
do it, but we selected this method for a couple of reasons:

* It eased the tracking of where shadow volumes were, and enabled us to easily 
find shadow volumes that might no longer be needed on a a given shadow server.
* It very much reflects the problem we're trying to solve: disastrous loss of 
either (a) a file server or (b) an entire data center. A quick ability to tell 
a server 'promote everything' made for quick and accurate response in the face 
of not having the shadow data in the vldb.

To support this process, every night (or at interval you choose) the shadow 
servers would examine the primary volumes on it's paired server, and would 
create or refresh the shadows as needed. We intended to update our provisioning 
process for volumes such that shadows would automatically be created when a 
primary was created or moved, but since shadow servers caught any missing 
volumes automatically, it was kind of low on the list.

Other things one could do with shadows:

I mention using shadows and their clones as part of a file restore system. 
That's nice, but rather a pain in many ways. It's also a desire to work around 
the limitations of only having 7 clone slots available. Having a significantly 
larger number of clones is a much better solution, but that's outside the scope 
of this project.

Thing envisioned but not yet followed through to an actual design:

* a vldb-like solution such that shadows(s) of a given primary could be 
identified easily and moved/updated appropriately. In the best of all worlds, 
this would be a part of the vldb, but that's a lot to wish for
* volume-sensitive and shadow-sensitive decisions on freshen frequency. One 
might refresh critical data volumes quite often, less critical ones rarely or 
not at all. One might refresh on-site shadows frequently, off-site daily
* remote shadows become your long-term backup system. This would require 
several features, most critical
** the ability to have clones of shadows, one clone per, say, each daily 
backup. Note this requires that refreshing a clone should also have to manage 
those clones in some flexible way
** the ability to promote a shadow to a different name. This enables the shadow 
and its clones to be made visible without taking the production off-line.
* clones (in particular, .backup) of a primary should be refreshable to a 
shadow, ie, specified clones of the primary could be refreshed to the shadow
* some way of mediating between incompatible operations, eg, have refresh 
operations either queue or abort cleanly if they would interfere with other 
activities like volume moves, backupsys, etc.

Some open questions:

It was clear we were talking about volume families - a primary, its clones, 
it's shadows, their clones, etc, etc. Should you be able to have shadows of 
shadows? We think so; refreshing multiple shadows of a given volume shouldn't 
require hitting the primary multiple times nor doing all those refreshes in 
lock steps. We need to establish a sort of taxonomy of volumes with 
well-defined relationships. Dan and I came up with a lot of ideas, but are very 
aware that we were reasoning in the dark. Other sites might well have other 
needs that would affect this.

I think we were sliding towards a transparent upward-compatible replacement of 
the vldb as well. Based purely on how I imagine the vldb to work :-), it should 
be possible to add shadow data to it and define some additional rpcs. Users of 
the old rpcs would only get the data that was in the 'legacy' vldb, users of 
the new rpcs would get shadow data as well. That's a door folks may not want 
opened yet, but it seems a better choice than bolting a separate 
shadow-oriented vldb to the side.


So that's where we are. I believe out latest shadow software is built against 
1.4.11, but could be wrong. If folks are interested, I'd be happy to chat with 
Dan and we'll release the patches to interested parties.

If folks think this is worth writing up in the afslore wiki as a partial 
project, I'd be glad to take this note and shovel it in with appropriate 
formatting.

Steve


 








_______________________________________________
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to