For the purposes of ease of software and hardware management, I'm attempting to run a set of PXE-booted Client machines as web/db or mail servers.

The NFS/DHCP/YP servers are running on a 5.4-STABLE Server. I mostly followed the PXE guide when building these systems.

All of the disk (except for swap) sits on the master Server (which has a bunch of external drive sleds), and all of the Client machines boot via Gig-E.

Client machines are running 5.4-STABLE as well, but it is not compiled with the same kernel configuration as the master Server, as the hardware is slightly different. Client machines share userland with the Server.

At the moment I have one Client machine running about 40 domains of web and db, with reasonably low traffic (less than 3Mbit/sec total) and one Client machine booted from the master Server, but not doing anything.

Resource utilization on the master Server seems pretty low.

Sporadically, there appear to be stalls on some locks with rpc.lockd. These lock stalls exhibit "interesting" behavior on the Client machines: Slots will fill up on Apache in the "W" state. SSH login attempts to the client machine (passwd files get some user data via YP) will hang and timeout. when I find a file (via Apache's extended status) which appears to be one of the stalled locks, and I attempt to do anything with the file via a shell on the client machine, such as "cat" it, that shell will become unresponsive. Any process which is stalled on one of these files cannot be killled.

On the server, the only symptom I've witnessed is that rpc.lockd starts using a bit more proc than it usually does. Normal utilization is 0.0, and when the problem is happening, proc might go up to 3.0 or so. "cat"ing a file on the Server which appears stalled on the Client, works fine.

A stop and start of nfslocking on the server seems to clear things up. Apache on the client will recover on its own, I'm guessing after each stalled lock reaches a timeout. I usually gracefully restart Apache, which forces the recovery to happen faster.

As far as timing, it doesn't appear to be consistently periodic. It doesn't appear to be load related - I suffered through a Digg of one of the sites, and while the client machine served more bandwidth that couple of days than it had in a month, this particular problem did not occur.

Over the past three months or so, this issue has probably cropped up three or four times.

What can I do to troubleshoot this? I would like to add more client machines, but I can't until this problem is resolved.

Changing OS builds at this point, unless absolutely necessary, is not something I want to do.

Thanks for any insight!

