Hi,

I've been having problems with the NFS-mount on one of my loadbalancers
and I can't seem to find a good solutions,
so I thought I might ask here for any (debugging) tips.

Currently I mount our tools&scripts repository over NFS on all our servers.
I'm using the following mount options for that:
timeo=16,intr,lock,rsize=16384,wsize=16384,tcp

Originally these client-side mount options (apart from the tcp one) were not
present, which caused a lot of problems due to hanging nfs-shares and
uninterruptable processes. Hence the intr/timeo values.
With a  hanging nfs mount I'm referring to an unresponsive mounted
filesystem which hangs any process trying to access it.
This includes tab-completion for bash etc.

Changing these nfs options has alleviated the problems over the cluster
immensly, where we used to have at least 2-5 servers hanging in limbo
every day, this has dropped considerably to maybe one per week.

With one exception though  :-( 

Our loadbalancer has  troubles with this nfs-mount roughly every two or
three days.
The mount simply hangs. Because of the intr option I can actually kill
all processes using the filesystem and umount/mount the share. After
this procedure it will start to function normally.

I'm very certain it isn't a nfs-server problem as I have the same
filesystem mounted on approx. 80 servers and on those machines the
filesystem is accesible. Testing access to the nfs-server confirms this,
I can mount the same share on a different mount-point without any problems.

The relevant entry in /etc/exports on the nfs-server is:
/mnt/raid2/tools/ 10.10.0.0/24(rw,async,no_root_squash)

I can't find anything relevant in the logfiles when this hang occurs,
tips on specific things to look for would be appreciated.

One of the reasons that this problem is highly annoying is that I run a
script from our tools share to sync the load-balancer config between my
primary and secundary loadbalancer. The moment the share hangs, this
script obviously doesn't work either.
Ergo I run the risk of having my load-balancers out-of-sync which I
would prefer to avoid.

Maybe this script is causing the problem ?
It's run every 5 minutes from cron.
Or maybe the combination async with frequent access is causing the
problem ?

Googling turns up lots of info on nfs. I've read through a bunch of
howto's and performance advice.
Most of it is pretty out-dated and refers to 2.2 linux kernels  :-( 
Some of it had some good advice, but none of it solved the problem so far.

All servers run gentoo-linux, 2.6.x kernels and a recent install of
mount-binaries
* equery belongs $(which mount) =>sys-apps/util-linux-2.12r  on client
* equery belongs `which rpc.nfsd` => net-fs/nfs-utils-1.0.6-r6 on the server

The following options are passed to the nfs daemons at start:
# Number of servers to be started up by default
RPCNFSDCOUNT=128

Any help, advice or pointers on where to look would be very appreciated.
Thanx,

Ramon

- 
Change what you're saying,
Don't change what you said

The Eels

-- 
[email protected] mailing list

Reply via email to