Re: Large (?) AFS disk farms

Kevin Hildebrand Mon, 16 Feb 1998 20:55:28 +0100 (MET)

> 'attach' time *is* a nuisance - in particular since (luckily!) this is
> something that we encounter much more frequently than fsck and salvage.  In
> order to restart a fileserver it first has to go down piecefully, something
> that can easily take 5 minutes during which volumes are simply not accessible
> :-(.  Startup is even slower, typically 10-15 minutes on a bigger server but
> at least users get 'waiting for busy volume' and no 'no such device'-rubbish
> as in the first case. 

Most definitely!  We have three fileservers serving 80GB of data-
consisting of approximately 40,000 volumes (user home directories).  A
clean restart takes half an hour to shut down and at least an hour to
attach.  When fsck and salvage are involved add another two hours on
top of the whole works.  The fileservers are serving data from hot
swap RAID 5 disks, so shutdowns are kept to an absolute minimum...
I'd love to see some way to improve the attach time- it would be nice
if more than one volume could be attached at one time.  

> Having said this: the original question was about a 500 GB fileserver. As 
> far as I understand restart time is governed by number of volumes to 
> attach - one could cut down on those by making each one 2 GB and bigger.
> We have seen occasions on which somthing like that does not sound 
> ridiculous.

When each volume is a user home directory that becomes more
difficult.  It's next to impossible to maintain individual user quotas
without giving each user their own volume.

Our other main problem is that with so many volumes on one server we
suffer from some pretty nasty performance problems at peak periods.
I've been working on tuning the servers and have found two things that
seem to improve performance.  The biggest win so far is to keep all of
the volume headers cached (the -vc argument to the fileserver).  The
default I believe is 600, in our case I have the value set at 12800.  

The other parameter I've found useful is to raise the number of server
threads (the -p argument to the fileserver).  In our case the value is
set at 25 which is the maximum that can be specified without the
fileserver dumping core in a few hours.

And while I'm on the subject, during my performance monitoring I'm
noticing regular periods of sluggishness on all three of these
fileservers.  The periods occur every 16 minutes at which time there's
a burst of disk I/O and a large drop in network traffic.  The pauses
one any one server don't necessarily coincide with the pauses on the
other two, so I suspect the problem is local to the fileserver.  All
three servers are DEC Alphas running DU4.0 and the latest fileserver
binaries.  The machines do nothing but AFS fileservice and I've
commented everything out of cron.  Is there something in the
fileserver itself that runs every 16 minutes?

Kevin Hildebrand
University of Maryland, College Park
Re: Large (?) AFS disk farms

Reply via email to