Hi, I've been having problems with the NFS-mount on one of my loadbalancers and I can't seem to find a good solutions, so I thought I might ask here for any (debugging) tips.
Currently I mount our tools&scripts repository over NFS on all our servers. I'm using the following mount options for that: timeo=16,intr,lock,rsize=16384,wsize=16384,tcp Originally these client-side mount options (apart from the tcp one) were not present, which caused a lot of problems due to hanging nfs-shares and uninterruptable processes. Hence the intr/timeo values. With a hanging nfs mount I'm referring to an unresponsive mounted filesystem which hangs any process trying to access it. This includes tab-completion for bash etc. Changing these nfs options has alleviated the problems over the cluster immensly, where we used to have at least 2-5 servers hanging in limbo every day, this has dropped considerably to maybe one per week. With one exception though :-( Our loadbalancer has troubles with this nfs-mount roughly every two or three days. The mount simply hangs. Because of the intr option I can actually kill all processes using the filesystem and umount/mount the share. After this procedure it will start to function normally. I'm very certain it isn't a nfs-server problem as I have the same filesystem mounted on approx. 80 servers and on those machines the filesystem is accesible. Testing access to the nfs-server confirms this, I can mount the same share on a different mount-point without any problems. The relevant entry in /etc/exports on the nfs-server is: /mnt/raid2/tools/ 10.10.0.0/24(rw,async,no_root_squash) I can't find anything relevant in the logfiles when this hang occurs, tips on specific things to look for would be appreciated. One of the reasons that this problem is highly annoying is that I run a script from our tools share to sync the load-balancer config between my primary and secundary loadbalancer. The moment the share hangs, this script obviously doesn't work either. Ergo I run the risk of having my load-balancers out-of-sync which I would prefer to avoid. Maybe this script is causing the problem ? It's run every 5 minutes from cron. Or maybe the combination async with frequent access is causing the problem ? Googling turns up lots of info on nfs. I've read through a bunch of howto's and performance advice. Most of it is pretty out-dated and refers to 2.2 linux kernels :-( Some of it had some good advice, but none of it solved the problem so far. All servers run gentoo-linux, 2.6.x kernels and a recent install of mount-binaries * equery belongs $(which mount) =>sys-apps/util-linux-2.12r on client * equery belongs `which rpc.nfsd` => net-fs/nfs-utils-1.0.6-r6 on the server The following options are passed to the nfs daemons at start: # Number of servers to be started up by default RPCNFSDCOUNT=128 Any help, advice or pointers on where to look would be very appreciated. Thanx, Ramon - Change what you're saying, Don't change what you said The Eels -- [email protected] mailing list
