We have a 4-processor SGI Onyx that has been unstable for quite a while.  
SGI told us to go to IRIX 5.3 because we were running into graphics bugs in 
IRIX 5.2 that they said were fixed in IRIX 5.3.  We begged Transarc for a 
pre-alpha and/or alpha version of AFS, which they were very kind to provide.

On single processor machines, even this very early release of code was quite 
reliable (we had Indigos up for 28+ days) but on the multiprocessor Onyx 
things weren't as stable.  This weekend I discovered that IRIX provides the 
capability for the kernel to treat device drivers as "unsemaphored for 
multiprocessing."  By using the SP libraries with the following changes I 
have seen a dramatic improvement in the stability of our multiprocessor SGI 
Onyx, and offer these changes to the list in case there's someone else out 
there in a similar situation:

 - Single-processor kernel libraries installed from the alpha release of 
   AFS 3.4 from Transarc.  (/usr/vice and /usr/afsws updated appropriately
   as well)

 - /var/sysgen/master.d/afs non-semaphored flag set (with appropriate
   symlinking so that old versions are still around.)  See
   /var/sysgen/master.d/README for the appropriate flags.

 - /etc/config/afsd.options updated with -daemons 1.

 - /etc/init.d/afs changed so that afsd is started under runon 0

 - /var/sysgen/master.d/stune and /var/sysgen/master.d/mtune/kernel
   updated so that max_netprocs = 1  (This undocumented tuning 
   variable causes the kernel to single-thread networking--important 
   so that incoming AFS network packet processing doesn't get 
   scheduled on multiple processors simultaneously.)

 - /etc/init.d/afs set up to delete everything from /usr/vice/cache that
   contains data  (This is to keep crash-corrupted cache entries from
   persisting past crashes.)

Since making these changes I have (in the spirit of regression testing):

 - Run 4-way parallel makes (gnumake --jobs 4)

 - Run my ~nickless/crashsgi.c program

 - Done other stress testing

....all without AFS-related system crashes or hangs.  There is still
an AFS bug related to the way emacs preloads data and then coredumps
itself as part of the compile/build process, but this seems to be a limited 
case.

Unfortunately, in the process of making these changes I discovered two other
non-AFS-related stability bugs in our system.  I'm reporting those to the
appropriate vendors now as well.  Oh well, life goes on.
--
Bill Nickless              [EMAIL PROTECTED]               +1 708 252 7390
PGP 2.6.2 Key fingerprint =  0E 0F 16 80 C5 B1 69 52  E1 44 1A A5 0E 1B 74 F7
                 http://www.mcs.anl.gov/people/nickless


Reply via email to