Re: MR-AFS...

Ken Hornstein Wed, 15 Apr 1998 20:20:49 +0200 (MET DST)
>Yes, MR-AFS is still alive and is heavily used at some sites:
>       
>       My site: Max-Planck-Institut fuer Plasmaphysik
>       Universitaet Koeln
>       Universitaet Magedburg
>       Technische Universitaet Chemnitz
>       Navy Research Center, Washington DC

Well, since we're on that list, I suppose I'd better comment :-)

We've been "rolling out" MR-AFS for the last couple of years.  This
effort has stalled a number of times for various reasons (sometimes
lack of manpower, interest, or technical obstacles).  We started
originally with the PSC code a long time ago, but switches to Hartmut's
version after it was obvious that the PSC code was dead.

I will honestly say that MR-AFS has been a long and painful road for
us.  We've been sticking with it mainly because we're a big AFS shop,
and it was the only game in town.  A lot of things have contributed to
these problems, and it has taken a while to understand some of the
interactions between MR-AFS and the underlying archival system.

However, to give credit where credit is due, a large percentage of
these problems have not been due to MR-AFS itself.  MR-AFS makes
an assumption that your archival system works ... and it so happens
that we bought an EMASS tower which turned out to be a giant lemon.
We've also had a number of network infrastructure problems which
contribute to problems with MR-AFS and the EMASS system (the
EMASS system runs on _three_ separate computers).

To give you an example of the kind of problems we were having: for a
long long time, we have never been able to move volume between
fileservers on our MR-AFS cell.  It would fail during the volume clone
operation after a number of hours.

It turns out the root of this problem is that MR-AFS uses the Unix mode
bits on the Unix filesystem residencies to implement the AFS reference
counts.  So when you need to increment or decrement the reference count
you need to do a chown().  This isn't normally a problem ... but in our
EMASS system, every chown() system call takes half a second.  It turns
out that this is because the file database stores the Unix mode bits,
and doing a chown requires a database update.  This would eventually
make the volume server not able to keep up with requests, and even if
it did, we never had our machines and network stay up for the estimated
60+ hours it would take to move some of our larger volumes.  After
talking with Emass, we were able to get them to remove this "feature"
and volume moves now are possible (and happen in a more reasonable
amount of time).

So, was this the fault of MR-AFS?  Not really, but it took me a while
to understand what was going on.

However, working with Hartmut has been a great help; he's fixed a
number of long-standing problems we've had with MR-AFS, and has been a
great resource in general with respect to MR-AFS and AFS.  He's also
added to MR-AFS a number of more general features that I think Transarc
would be wise to look at (such as support for non-voting Ubik
"clones").  With his help, we have finally gotten MR-AFS to a point
where it's something that I think it actually useful.

I would encourage everyone to look at it and try it out.  It _is_
something worthwhile, and I think it even offers some advantages over
the current DFS HSM offering (but I'm not completely familiar with the
DFS HSM, so don't quote me on that).  But definately get Hartmut's
version; you don't want to waste your time with the bitrotted PSC code
(if you can even get it from them anymore -- I got the impression that
they aren't distributing it nowadays).

--Ken
Re: MR-AFS...

Reply via email to