We have a 4-processor SGI Onyx that has been unstable for quite a while.
SGI told us to go to IRIX 5.3 because we were running into graphics bugs in
IRIX 5.2 that they said were fixed in IRIX 5.3. We begged Transarc for a
pre-alpha and/or alpha version of AFS, which they were very kind to provide.
On single processor machines, even this very early release of code was quite
reliable (we had Indigos up for 28+ days) but on the multiprocessor Onyx
things weren't as stable. This weekend I discovered that IRIX provides the
capability for the kernel to treat device drivers as "unsemaphored for
multiprocessing." By using the SP libraries with the following changes I
have seen a dramatic improvement in the stability of our multiprocessor SGI
Onyx, and offer these changes to the list in case there's someone else out
there in a similar situation:
- Single-processor kernel libraries installed from the alpha release of
AFS 3.4 from Transarc. (/usr/vice and /usr/afsws updated appropriately
as well)
- /var/sysgen/master.d/afs non-semaphored flag set (with appropriate
symlinking so that old versions are still around.) See
/var/sysgen/master.d/README for the appropriate flags.
- /etc/config/afsd.options updated with -daemons 1.
- /etc/init.d/afs changed so that afsd is started under runon 0
- /var/sysgen/master.d/stune and /var/sysgen/master.d/mtune/kernel
updated so that max_netprocs = 1 (This undocumented tuning
variable causes the kernel to single-thread networking--important
so that incoming AFS network packet processing doesn't get
scheduled on multiple processors simultaneously.)
- /etc/init.d/afs set up to delete everything from /usr/vice/cache that
contains data (This is to keep crash-corrupted cache entries from
persisting past crashes.)
Since making these changes I have (in the spirit of regression testing):
- Run 4-way parallel makes (gnumake --jobs 4)
- Run my ~nickless/crashsgi.c program
- Done other stress testing
....all without AFS-related system crashes or hangs. There is still
an AFS bug related to the way emacs preloads data and then coredumps
itself as part of the compile/build process, but this seems to be a limited
case.
Unfortunately, in the process of making these changes I discovered two other
non-AFS-related stability bugs in our system. I'm reporting those to the
appropriate vendors now as well. Oh well, life goes on.
--
Bill Nickless [EMAIL PROTECTED] +1 708 252 7390
PGP 2.6.2 Key fingerprint = 0E 0F 16 80 C5 B1 69 52 E1 44 1A A5 0E 1B 74 F7
http://www.mcs.anl.gov/people/nickless