Re: database server behavior during network partition

Marcus Watts Thu, 11 Aug 1994 17:28:01 -0400
> "rick" == rick  <[EMAIL PROTECTED]> writes:
rick> I propose that the database server policy be changed so that a
rick> system administrator can optionally configure a server to
rick> allow read access to server databases when communication with
rick> the sync site is broken.

> "rens" = From: Rens Troost <[EMAIL PROTECTED]> writes:
rens> The problem is that you cannot use a cells servers if you cannot
rens> contact a majority of them. My site is unusual in that there are three
rens> sites that should really be in the same cell, for software
rens> distribution purposes, but that cannot be for reliability reasons. 
rens> 
rens> If there was some mechanism for intercellular volume synchronization,
rens> then I'd make three cells, and synch R/O volumes between them. My
rens> users need access to their R/O volumes even if the WAN is destroyed by
rens> hungry carpenter ants. I am not deploying AFS at my sute, in
rens> consequence, until a solution to this problem is found.

The usual problems with "inter-cell" issues is because
they cross administrative boundaries, so there are bunch
of trust and security issues.

In this case, it very much sounds like you don't have that.
So there are some short-cuts you can take that might meet
your needs.

For instance:

Split things up into N cells, one per site.  Make sure all N cells have
the same key for AFS.  By doing this, they can all be "one big happy
family" from the fileserver standpoint.

Volumes & users would ordinarily be completely independent between cells.
That is, each user will have their own ID & key, and there wouldn't
be any guarantee users will have the same ID, key, or password
in each cell.  In order to make this kludge work, you will need
to guarantee that one consistent administrative procedure is
followed to create users & groups, and that the procedure
works between all N cells.

Having guaranteed that the key of AFS is the same, and that
user ID's are the same, you can now *share* volumes between cells.
In order to do that, after you create the volume on one cell,
you will need to add "shared" volumes to the VLDB on the other N-1
cells.  If transarc distributes "vlclient" you might be able to use
that straight, otherwise, you might have to write a small C program
that links into libvldb to make the actual RPC's to vlserver to
cause this magic to happen.  The C program would be making ubik
calls on the N-1'th cell's db servers to invoke VL_CreateEntry, VL_ListEntry,
or other routines.  Once you've done this, you will also need to exercise
extreme caution in using commands like "vos remsite", "vos syncserv"
or "vos syncvldb".

"Michael Niksch" <[EMAIL PROTECTED]>'s response came in as
I was finishing this reply - what he proposes is more or less
a variation of what I'm describing here, with similar sorts of risks.

Another possibility might work a lot better but would be a lot more work.

There are probably only 3 services you really care about on the DB servers;
ka, vl, & pt - as long as something responds to the appropriate RPC
with the appropriate data at each site, you probably don't care what
it's *really* doing behind the scenes.  So, another possibility is this:

Select one site and make it the "master" cell.  It could probably
run stock AFS server binaries.

At the other "slave" sites, do magic.  Configure machine to act as
"fake" database servers.  They would contain special server binaries
that would respond to the appropriate RPC's, and return data to the
clients, so the clients would be happy.  But the server binaries
themselves would not be the *real* server binaries, but clever
impostors posing as the real thing.  Since you control them, they
would *also* know the key of AFS, thus permitting the deception.

But what they would *really* do is - first make the call on
the master (``real'') cell.  If it succeeds, hand the results
back.  If the call fails, then "do the call" locally.  Lookup
the results in a local copy of the database, and hand results back.
The "local" copy of the database could be populated in one of
2 ways:

(1) keep a "cache" of results from the master cell
on the local machine.  This might not work real well,
        but it might be pretty quick & easy to code.

(2) hook into ubik and "steal" a copy of the database
"every so often".  For your needs, a "daily" copy
might be good enough, or you might prefer to figure
out a more proactive approach.

In thinking about it, I can think of some clever ways to "steal" a copy:

If the master cell has the low IP address, and you don't have too many
slave sites in relationship to the master site, and you have enough
master site db servers, and you are convinced that the master site
machines are reliable enough, you might configure an "extra" db server
at each site that would be configured as a db server in the master
cell, but that would not be listed at master site clients.  The
"extra" db server would naturally acquire a copy of the database,
which could then be propagated to the extra "bogus" servers at each
slave site.

Or, with a bit of work, this approach could be inverted and used at
each slave site.  Here's a somewhat elaborate 3 machine approach:

slave machine #1
        acquires a "copy" of the DB from the
        master site.  Eather just a strict
        file copy (ftp) or perhaps using
        the "extra" master impostor approach.
        it responds to ubik but always nominates
        itself as the sync site.  Giving it
        the low IP address at the slave site
        will suffice for this.
slave machine #2
        runs the real transarc binaries, and
        lists slave machine #1 & #2 as the
        2 db servers.  The real transarc
        binary would act effectively as a
        read-only file interpreter.
slave machine #3
        this is the only one listed on slave
        site client machines.  It doesn't
        know about ubik, but does know about
        the RPC's - it always tries the request
        on slave machine #2.  If it fails,
        it then retries it on the real master
        site using ubik.  An optimization might
        be to remember which opcodes fail
        becase of not being a sync site on the slave
        and go directly to ubik for those.

Any of these ideas would require at least some C programming,
and quite a bit of familiarity with RPC's, kerberos, threads,
RX, ubik, and other AFS services.  With a lot of work,
this would probably suffice - but having AFS source would
certainly be a good short-cut, and probably save months
of effort.  Even without full source, it might still be possible
to wrangle some useful bits out of transarc.  The *.xg grammars
from AFS would help a lot here; that would give you a useful machine
readable hook into the real RPC's used and save quite a bit of work.

There is also, of course, the "chicken" approach, which would
just be having a cron script that copies stuff over from a
directory tree in one cell to another cell nightly - which
requires much less knowledge of the system and is likely
to prove portable to DCE or other architectures.

A variation of that would be to do a "vos dump" / "vos restore",
and perhaps trigger it based on the lastmod time of the readonly.

                                -Marcus Watts
                                UM ITD RS Umich Systems Group
Re: database server behavior during network partition

Reply via email to