> I'm whistling in the dark, here, but on the off chance:
> We've been seeing machine hangs every once and awhile (twice in the last
> month on a very heavily used machine) on our DEC 5000/200 (Ultrix 4.4, AFS 3.3
> client) - the machine locks up and prints "cant get mbufs" to the console
> until it's rebooted. It could be _lots_ of things, obviously (since we
> can't do anything once it's hung, though, it's hard to tell)... Has anyone
> else seen this sort of behavior?
> Thanks
> Pat Wilson
> [EMAIL PROTECTED]
We had major problems here getting AFS to run on our Ultrix 4.4
machines- sometimes the machines would hang with the message you
mentioned above, and sometimes they'd just hang with no messages at
all. After many mail messages back and forth to Transarc, we found
out two things that needed to be done.
1) Rebuild the kernel such that the following parameters are adjusted:
1. Increase physmem to be 2 times physical memory
2. Increase maxusers to be 2-4 times the current value
2) There is an ultrix 4.4 patch that solved most of the rest of the
hangs- it is also related to problems with kernel memory allocation.
I include below the notes I got from Transarc about the kernel memory
problem- they talk about Ultrix 4.3 but they are applicable to 4.4 as
well. I also include the description and patch number for 4.4.
These changes have fixed 99% of our hang problems, and I suspect the
remaining problems we have are not related to AFS...
Good luck!
Kevin Hildebrand
University of Maryland, College Park
Project Glue Systems Developer
-------------------------------------------------------------------------------
A number of AFS sites running on Ultrix machines have encountered
problems where their machines hang while running AFS. This has
happened with different versions of Ultrix and AFS, but the problem is
exasperated with newer versions of both (Ultrix 4.3a and AFS 3.3).
The symptoms include hangs with errors such as "No buffer space
available" and "can't get mbufs". The cause is running out of kernel
memory, even in cases where there appears to be sufficient memory on
the system.
We have encountered two reasons for the hangs seen by AFS users:
1) There is an Ultrix problem related to the SCSI bus that results in
the system hanging when it runs out of kernel memory. There are
patches available from Digital to sim*.o files to fix this problem.
In order to obtain these patches, the customer must report the problem
to Digital. Digital would like to verify that the customer is seeing
this problem before delivering the patches to them. To verify the
problem yourself, you can look at the crash file with the crash
utility, do a "scsi -all" or "cam -all" if they are running cam, and
it will say "target has been reset". The patches only work if you
have cam installed.
Several AFS sites are running these patches, which seem to help at
least to some extent.
2) AFS allocates kernel memory for internal data structures, and at
several AFS sites, AFS has tried to allocate more memory than is
available, which results in a system hang. AFS allocates memory from
a particular memory pool in the kernel. The size of that memory pool
is calculated by Ultrix based on the system parameter "physmem".
Therefore, the solution in these situations is to increase the size of
physmem, even to values much larger than the actual memory on the
machine. In our testing at Transarc, we found that setting this value
to 4 times (or more) the size of physical memory provided a reasonable
amount of kernel memory for AFS data structures, and alleviated system
hangs.
Aside from this memory pool, AFS also uses some other kernel tables.
Therefore, increasing the system parameter "maxusers" could also help
in systems that have a large number of users logging into the AFS
system. Another system parameter that is useful to adjust on AFS
machines is "ifqmaxl". This affects the amount of memory that is
allocated for memory buffers. The suggestion is to increase this
parameter ot 1024 or 2048.
There are some interactions between these parameters, and some
possible side effects that we don't know about. And because of
variations in each machine's configuration and use, some amount of
trial and error regarding these settings might be required.
For a number of sites, setting the following has fixed their problems:
1. Increase physmem to be 2 times physical memory
2. Increase maxusers to be 2-4 times the current value
3. Increase ifqmaxlen to be at least 1024, possibly 2048
We also have had some users encounter problems running AFS 3.3 servers
and clients on Ultrix 4.3a machines, and these values haven't
completely solved the problems. We are still trying to determine the
best combination of values, as well as investigate other changes, to
work in these situations.
If you are seeing hangs on your Ultrix system that appear to be due to
lack of kernel memory, make the changes described above. If these
changes do not solve the problem, contact your Transarc Customer
Support Specialist. We will work with you to find the right
combination of parameters to allow you to run AFS successfully. We
will consult with Digital as necessary to ensure that the suggestions
we make are reasonable.
Liz Hines
Director of Product Support
Transarc Corporation
------------------------------------------------------------------------------
Ultrix Patch 7BXB35390
RISC systems with a small amount of physical memory hang when
many processes are running. Usually, message like 'cannot
get mbufs' is observed.
/usr/sys/config/mips/param.c contains a configurable variable
guardpages which is defaulted to GUARDPAGES. Setting guardpages
to zero allows more wired and unwired kernel memory to be used.
This is useful for RISC system with a small amount of physical
memory but wired/unwired kernel memory is in great demand.
Usually, the configurable parameter physmem is adjusted in
this situation.
Installation:
Kernel rebuild required.