For the purposes of ease of software and hardware management, I'm
attempting to run a set of PXE-booted Client machines as web/db or
The NFS/DHCP/YP servers are running on a 5.4-STABLE Server. I mostly
followed the PXE guide when building these systems.
All of the disk (except for swap) sits on the master Server (which
has a bunch of external drive sleds), and all of the Client machines
boot via Gig-E.
Client machines are running 5.4-STABLE as well, but it is not
compiled with the same kernel configuration as the master Server, as
the hardware is slightly different. Client machines share userland
with the Server.
At the moment I have one Client machine running about 40 domains of
web and db, with reasonably low traffic (less than 3Mbit/sec total)
and one Client machine booted from the master Server, but not doing
Resource utilization on the master Server seems pretty low.
Sporadically, there appear to be stalls on some locks with rpc.lockd.
These lock stalls exhibit "interesting" behavior on the Client
machines: Slots will fill up on Apache in the "W" state. SSH login
attempts to the client machine (passwd files get some user data via
YP) will hang and timeout. when I find a file (via Apache's extended
status) which appears to be one of the stalled locks, and I attempt
to do anything with the file via a shell on the client machine, such
as "cat" it, that shell will become unresponsive. Any process which
is stalled on one of these files cannot be killled.
On the server, the only symptom I've witnessed is that rpc.lockd
starts using a bit more proc than it usually does. Normal utilization
is 0.0, and when the problem is happening, proc might go up to 3.0 or
so. "cat"ing a file on the Server which appears stalled on the
Client, works fine.
A stop and start of nfslocking on the server seems to clear things
up. Apache on the client will recover on its own, I'm guessing after
each stalled lock reaches a timeout. I usually gracefully restart
Apache, which forces the recovery to happen faster.
As far as timing, it doesn't appear to be consistently periodic. It
doesn't appear to be load related - I suffered through a Digg of one
of the sites, and while the client machine served more bandwidth that
couple of days than it had in a month, this particular problem did
Over the past three months or so, this issue has probably cropped up
three or four times.
What can I do to troubleshoot this? I would like to add more client
machines, but I can't until this problem is resolved.
Changing OS builds at this point, unless absolutely necessary, is not
something I want to do.
Thanks for any insight!
firstname.lastname@example.org mailing list
To unsubscribe, send any mail to "[EMAIL PROTECTED]"