Re: Duplication of AFS tree between cells

W. Phillip Moore Fri, 27 Dec 1996 01:28:12 GMT
>>>>> "Marcus" == Marcus Watts <[EMAIL PROTECTED]> writes:

Marcus> Your best bet may be to think of this as a "per-volume" issue,
Marcus> instead of a "per-file" issue.  Instead of having a notion of
Marcus> a directory tree you replicate, you might instead think of
Marcus> having a list of volumes you replicate.

This is precisely what we do here, to distribute data form a single
central volume, to as many as 30+ cells globally.  The system which
does this is a very significant engineering effort, supports
incremental distribution in parallel, and so many other features that
I'll stop here, and merely suggest that you attend my Decorum 97 talk
on the very same subject. ;-)

Marcus> To do this, you could do a "vos dump" and "vos restore" to
Marcus> copy volumes over.  You can check the volume modification
Marcus> timestamp to see when it's needed.  It's also possible to do
Marcus> incremental vos dumps, which might save on transportation
Marcus> costs.

There are several potential problems with incremental distribution,
only some of which we have managed to solve.  First of all, bear in
mind that Transarc never designed AFS to support this type of
distribution, but the fact that you can dump/restore incrementally
would seem to imply this is easy.

First of all, the current release of the volserver does something
seemingly inocuous whenever the RO clone volume is re-created.  AFS
uses an version number on each vnode to determine whether or not the
data in the client AFS cache is up todate or not.  One might have
assumed that data such as the modification time of files or
directories was used,but this is incorrect.  The version numbers can
be seen explicitly by using the volinfo command:

For example, if you look at root.afs in our environment:

volinfo -vol 536870915 -part /vicepa -vnode
.... stuff deleted ....
Large vnodes (directories)
         0 Vnode 1.1.113 cloned: 1, length: 2048 linkCount: 3 parent: 0
       256 Vnode 3.3755.14 cloned: 1, length: 2048 linkCount: 3 parent: 1
       512 Vnode 5.3760.12 cloned: 1, length: 2048 linkCount: 2 parent: 3

Small vnodes(files, symbolic links)
.... more stuff deleted ....

Note the 3 numbers after each Vnode in the directory list:
eg. 1.1.113.  The 3rd number is the version number, in this case 113.
Whenever the RO clone (assuming you use them to save disk space) is
recreated, i.e. any time *any* vos release command is run, then the
volserver increments these numbers.

Again, this would seem harmless, but it has a couple of side effects.
First of all, even if the directory contents have NOT changed, client
will refetch directory entiries anyway, since the version number has
changed.

Second is the problem which affects incremental distribution, and
while this may seem intuitive, trust me it is not.  Working *with*
Transarc to understand the effects of this took us over a year and a
half.  That is itself a long story...

When an incremental dump of a volume is taken, it includes *all* the
directory entries in the source volume, including the version numbers
of the vnodes.  When this is incrementally restored to a target
volume, then the version numbers are also restored.  Now, vos release
that volume in the remote cell, and the version numbers increment.
Now, suppose between 2 subsequent dump/restore distributions to that
volume maintenance work is done in the remote cell which requires
another vos release.  The version number increment again.

Now, AFS clients in the remote cell will cache version numbers for the
directories which might be out of sync with the original source
volume.  If you touch a file in the sourec volume and redistribute it,
it is entirely possible that the clients will see a NET CHANGE of 0 in
the vnode version numbers for some directories, and thus they will
assume the data they have in their cache is completely correct, and
NOT update it.

This can be resolved ONLY by a vos release -f of the affected volume,
or by flushing the directory from the client caches.

This is *really* subtle and hard to understand (if you understood the
above after one reading, send me your resume).  The affect is an
apparently stale AFS cache on many clients, but in fact, the AFS
client code is working perfectly.  The problem is that incremental vos
dump/restore can play games with the vnode version numbers.

Now, we solved this by getting Transarc to write a simple tool which
can modify a vos dump by writing the current 32-bit time into the
version number field.  Since our distribution mechanism dumps to flat
files before performing the vos restores (since the restores can be
done in paralle off of the same file), this was easy to implement.

VERY hard to understand and debug.

Moral of this very long story: developing an incremtnal volume
distribution mechanism is non-trivial, and can be quite complex.

Come listen to my Decoum '97 talk for more details.

W. Phillip Moore                                        Phone: (212)-762-2433
Information Technology Department                         FAX: (212)-762-1009
Morgan Stanley and Co.                                     E-mail: [EMAIL PROTECTED]
750 7th Ave, NY, NY 10019

        "Grant me the serenity to accept the things I cannot change, the
         courage to change the things I can, and the wisdom to hide the
         bodies of the people that I had to kill because they pissed me
         off."
                        -- Anonymous

        "Every normal man must be tempted at times to spit on his
         hands, hoist the black flag, and begin slitting throats."
                        -- H.L. Mencken
Re: Duplication of AFS tree between cells

Reply via email to