Actually, things are worse than that: publically posted CellServDB
information (even transarc's) is rarely complete or up to date:
cells aren't always there, don't always have the advertised cell
name, or occasionally have stranger defects. New cells that
appear don't always come up right away, and test cells may not
always be up. Even when cells are up in a production sense, there are
still many cells (or networks?) that seem to have frequent down times at
off hours; if the site is located in Japan, those "off" hours may be during
the day in the states. And, with 177 hosts in 78 cells, even under ideal
conditions, there is still a fair chance at least one host or network link,
somewhere, is down.
Bad as all these problems are with an "ls -F", they become potentially
much worse with a graphical interface such as the NeXT or Macintosh.
At the University of Michigan, we have a 3 part process to
try to keep CellServDB's as up to date and error-free as possible:
1. an automatic daemon that runs three times a day and
attempts to check real-world status. It starts with a list
of CellServDB's which it merges together, and it then
runs a series of tests on each database server to determine
if it's alive, which cell it belongs to, and what DB servers
it knows about. The result of this is a snapshot in time of the
most complete list we can build of what was alive at that point.
This has evolved somewhat with time, the current version
is I think pretty fiendishly clever about its business.
2. a daemon, not yet totally automated, that builds a real
CellServDB based on the results of a week's worth of runs
of part 1 (thus ensuring a temporary network failure won't
show up.) A script is also generated to update root.cell.
These are both checked by hand (mostly for paranoia's sake),
and then run, to update a master local copy of CellServDB and
the cell root.afs.
3. a cron script to be run on individual workstations, that
copies over the master copy of CellServDB to the local
disk and does an "fs newcell" on the resulting cell entries.
In addition, specifically to deal with the NeXT, we have
a second shadow copy of root.afs specifically for the NeXT,
with special magic symbolic links so the actual directory doesn't
get referenced until the user references it.
The real solution is certainly some sort of distributed database.
Obviously, something like the domain name service would be
a lot better way to name cells and locate servers, than
a flat file. It looks like DFS aka AFS 4 should fix this,
so I guess we just have to wait for that.
-Marcus Watts
UM ITD RS Umich Systems Group