Oops, Lyle is indeed right - the file server won't do "fsync"s when
storing data unless you ask it to (and this is new even of AFS 3.3;
previous releases didn't do that.)  However, there is one
interesting case when the file server is guaranteed to do an
"fsync" - if you write to a file that is "shared" with another volume
(either as the result of "vos backup" or if replicated, with "vos
release"), the file server has to make a copy of the file the first
time you write to it.  That copy always does do an "fsync" and since it
only does it once, at the end of the file copy, if the file
is of any great size that fsync could be quite slow.  So,
depending on your cell, you might see an interesting performance
hit if you make small changes to many large files once a day.

However, it's still the case that the file server does do synchronous
read/writes and while it might seem intuitively obvious that
lots of I/O paths will increase speed, I think actual results
will be pretty disappointing.  Since most clients will have
a cache size of 64K, that's how big most logical I/O will be from
the AFS file server process.  The actual file system probably has much
smaller blocks, -- perhaps 4K, so each logical read will require many
successive block reads.  Most Unix implementations still follow
the well-known Unix V7 implementation - they read one block
ahead.  (AIX does it differently but the result is the same.)  For the
AFS file server, this will result in a very high percentage of buffer
read-aheads that "succeed" - but that is deceptive; the only real
win is the kernel is able to start reading the next block while it
copies the current block out of the kernel.  The AFS file server is
still in fact completely I/O bound, and it is not able to make effective
use of multiple I/O paths.

On the umich.edu file servers, last time I looked, we were
managing about 1% CPU utilization, and were about 10% I/O bound.

So far as volumes on DB servers goes - whenever you reference
a new AFS file, there is a very intricate dance that goes on between
your machine's cache manager, the DB servers, and the file server.
First, if your machine (CM) doesn't know where the volume is, it has to
contact a VL server (DB-VL).  Then CM can connect to the file server (FS),
if it doesn't already have a connection on your behalf.  If this is a
new connection to FS it needs to talk to the protection database (DB-PT) to
map your kerberos name into a vice ID, plus a 2nd call to find out what
protection groups your vice ID is a member of (DB-PT GetCPS).  Now that the
connection is setup, FS and CM can start slinging chunks back & forth.
As you can see, many RPCs to the DB servers could made just to satisfy one
request on one client machine - and so it is very definitely the case
that DB server performance will make a real overall difference.
That difference will be most perceptible to users when first logging in.
(UM users noticed a real improvement when the DB servers were
upgraded from memory bound RS/6K 320H's to fairly hefty 580's.)

If your DB servers are also acting as FS servers, FS service
could compete either for disk, or (for backups) for CPU.
This competition will naturally be keenest on your sync site,
especially so if you do many write modifications to your database,
and if your databases are of any size.  That means the DB servers of
a large site will certainly get quite a workout from backups.
Multiple I/O paths and striping could help with the DB servers;
unfortunately I don't think there's any support for splitting up the
DB files from ka, pt & vl (& bu) to different partitions or controllers.

If you need to run much file service on your DB servers, it would
be an advantage to put data that is infrequently referenced and
even less frequently changed on there, as well as data that does need
to be backed up as regularly.  This is most true of your numerically
lowest IP addresses (& I think transarc already recommends putting your
speediest machine there, if there's a difference in speed.)  Multiple
I/O paths will also help - if your file I/O is on a different path from
your DB activity that could make file service on the same machine
almost painless.  Here at the U, we mainly put small "critical"
replicated volumes the DB servers, although that's more for reliability
than for speed.

Peter Honeyman is correct in wishing that umich.edu's DB servers
were better separated.  Unfortunately, for umich.edu, there would
be no advantage, at least for now.  The reason to separate things
out would be for greater reliability through redundancy.  However,
the um backbone has essentially no redundancy whatsoever.  When
it dies, you can't get nowhere nohow noway.  Therefore, the
only possible consequence in moving the DB servers to separate
subnets would be increased unreliability - the unreliability
of the backbone plus the unreliability of whatever subnets
the DB servers were on.  So, before working to separate the
DB servers, the first thing to work on is some redundancy
in the rest of the infrastructure, so that separating the DB
servers would in fact result in significantly increased reliability.

In separating DB servers, there is one thing to watch for:
and that's the speed of the network path inbetween.  If ubik gets
out of sync, it's quite fond of shipping over a complete copy of the
DB, and if the databases are of any size, that can take
a substantial amount of time.  It's best if the network path inbetween
is as reliable as possible (to minimize outtages) and has the highest
possible transfer rates.  Even when nothing has gone wrong, any changes
that are made in the database have to be propagated to the other sites,
so lengthy delays between DB servers will definitely impact the time to
change a database.

                                -Marcus Watts
                                UM ITD RS Umich Systems Group

Reply via email to