We have a 5.2 system we're using as a storage server/filer running nfsd. We have hundreds of nodes that can hit it at one time; these clients are configured with autofs rather than permanent mounts (legacy from the early days).
We use NFS3 over TCP. Originally we configured the system with 100 daemons. Very quickly we started having jobs fail on the clients, and the server log had lots of messages that say: kernel: lockd: too many open TCP sockets, consider increasing the number of nfsd threads So we bumped it to 300, rebooting because we had a new kernel to run, anyway. Worked great for a few days, then we started getting failures again. I bumped it up to 500 daemons and tried to restart nfsd. nfsd refused to start, saying the port was busy. I couldn't find anything that I'd expect to use that port. I finally rebooted. No nfs. In the message log we now had: kernel: nfsd: Could not allocate memory read-ahead cache. nfsd[6413]: nfssvc: Cannot allocate memory [We have 8GB of RAM on the system, and at boot time with 300 nfsd we don't even come close to using 8GB.] Backed down to 300, had to reboot as nfs would not start. It came up fine, but we still see those pesky failures. It gets more interesting. Or bizarre. % cat /proc/net/rpc/nfsd rc 13537 33496396 192754161 fh 28 0 0 0 0 io 3943998555 1199297042 th 300 0 1188.353 239.850 65.863 16.361 1.857 0.000 0.000 0.000 0.000 0.000 ra 600 1328847 18752 16893 13305 9929 6954 5154 4301 3170 2710 0 net 226265416 0 226264783 70942 rpc 226260856 0 0 0 0 proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 proc3 22 4477 96527885 5693324 38850837 37663631 12004 3694379 10771245 7160052 1719510 42932 0 3152360 971863 3965505 33110 159 4197685 14857 4550 0 8837637 proc4 2 0 0 proc4ops 40 2284365 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 As I understand things, the "th" line here says that we have never come close to using all the nfs daemons at one time! So, we have two (possibly) problems. 1) Are the stats wrong, or is the problem not really in the number of threads? This is a fast, dual, quadcore SuperMicro server, so I'm not worried that it can handle the load; we have much slower systems handling 100 threads without a hiccup (the nature of the projects means this newer system will get a lot more traffic). The NIC doesn't seem to be swamped. Is there a kernel param I need to tweak for more open sockets or something? 2) If I do need more daemons, how do I determine how much memory I need? What is the limit on the number of daemons? Thanks, Miles
