This memo is an description of the sequence of events in a recent
AFS server failure at our site.  Our experience, I believe, uncovers
a bug, or lack of features in AFS 3.2 that limits the robustness of
the reliability feature of AFS acheived through replication.  Since Transarc
widely advertises filesytem reliability through replication, I would
like a technical explanation as to why these features have not
as yet been provided if they are easy to implement and if they
are difficult to implement, why that is so.

The behavior I am referring to is the manner in which AFS clients select
a list of AFS servers for file service.  This preference list can
be displayed with "fs getserverprefs".  From experiments I have learned
that clients assigns the list of servers according to network topology ONLY.
The client does not: 
        1) test if the server is up before adding it to the preference list OR
        2) order the list according to any client side configuration file like
           CellServDB

"fs setserverprefs" allows the administrator to rank the server preferences
once the afsd comes up and has established the base server preference list.
The is no client side tool to EXCLUDE A SERVER from the preference list
either before afsd comes up and sets up the base preference list OR
after afsd has selected the base preference list.  A server can be excluded
>From the preference list; however, in our case it required rebooting
BOTH the client and server, AND removing the server from the VLDB with
subsequent syncing.

Here is what happened to us:  

Our site has 4 AFS server, 2 RS6000s and 2 SUNS. The two RS6000s server
went down with disk failure errors; but the two SUNs were functional -
one of the SUNs was the master sync site and housed the read-write
volumes so one would have thought that we could have kept base AFS service
operable for most of the clients by redirecting the clients to use the
SUN servers.  The problem arose with the AFS client- the server
preference list for many of the clients was rs6000 #1, rs6000 #2,
then the SUN sync site.  The rs6000s housed read-only volumes.  
The clients were trying to reach rs#1, then rs#2 and then timing out.
When we rebooted the clients, afsd would try to reach rs#1, then rs#2 and 
then time out. AFS was unavailable to most of our clients even though the
base AFS cell was functional.  In order to recover the clients, our site had to:
        1) remove rs#1 and rs#2 from the server list
        2) reboot the sync server
        3) remove rs#1 and rs#2 from the vldb and sync the vldbs (takes 
           a long time!)
        4) reboot ALL the clients

This was alot of downtime for the AFS clients and a major
interuption in file service!  Our site depends on AFS for home directory
directory service and major infrastructure services like authentication
to our archive server.  I'd like to know why AFS clients don't have
better control over the server they select for file service, why that
file server can't be selected dynamically, or, at least minimally why
the timeout value can't be set dynamically so the client won't timeout
before it reaches an available AFS file server.


Nancy Yeager
Project Leader, Distributed File and Storage Servers
National Center for Supercomputing Applications
University of Illinois



Reply via email to