[nfs-discuss] osol NFS server stops responding (for a few seconds)

Kees Hoekzema Wed, 24 Feb 2010 07:00:43 PST

Hello all,

First of all, I'm coming from Linux, and I'm quite new to opensolaris, so i 
might use linux terminology once in a while.

For our new shared storage server, we recently purchased a Sun x4270 loaded it
with disks, ram and ssd's and installed opensolaris on it. This server has
replaced our old nfs server which ran linux. However we are running into a few
problems now and than with the Solaris NFS server.

uname output:
SunOS athos 5.11 snv_111b i86pc i386 i86pc Solaris

First we started using NFSv4 with ~30 ZFS datasets. The linux clients could
mount all the datasets with just one NFSv4 mount. This worked fine for a day or
so, but than all the mounts froze and trying to access those mounts would
result in a locked session. Trying to restart the nfs server with 'svcadm
restart network/nfs/server' would result in nfs server trying to shut down. But
in the ps output, i could see that the process '/usr/sbin/sharemgr stop -P nfs
-a' was just hanging and not doing anything i could see. Killing the process,
even with -9 would not work. So i started it with 'truss -fv all
/usr/sbin/sharemgr stop -P nfs -a' and it would just show it opening the files
in /etc/dfs and than no more output. The only option that remained was to
reboot the NFS server.

We blamed it on NFSv4, because mostly the Linux implementation isn't that
great, so we tuned down our number of datasets to 8 and mounted those with
NFSv3 on our clients.

But now the NFS server just stops responding once in a while. If i look at it
with snoop, i can see the clients requests, and normally the server responds to
it quite fast, but once in a while (once per hour or so, no set time) the
server won't respond for ~1-2 seconds. When i look at the snoop output to track
this server responding to clients i can see gaps:

last response:
173445 14:31:25.33526 athos -> ares NFS R ACCESS3 OK
(read,lookup,modify,extend,delete)

1.44s delay, 159 client packets later;
173604 14:31:26.77896 athos -> artemis NFS R COMMIT3 OK

So the server stopped responding to any client requests for 1.44 seconds. The
clients are mounted with the following options:
rw,sync,noatime,intr,timeo=3,retrans=2,soft,wsize=32768,rsize=32768,ac,acregmin=60,acregmax=600,acdirmin=60,acdirmax=600,vers=3
So understandable, because the timeout is 0.3s, the clients were spamming
nfs: server athos not responding, timed out

I tried some dtrace scripts, both for nfsv4 and nfsv3, but those that seem
interesting; for example the nfsv4io.d from
http://wikis.sun.com/display/DTrace/nfsv4+Provider print an error:
dtrace: failed to compile script ./nfsv4io.d: line 12: args[ ] may not be
referenced because probe description nfsv4:::op-* matches an unstable set of
probes

So i have a few questions;
- Is there a way to find out what the 'sharemgr stop' was trying to do and why
it hang? (for the next time it decides to do noting ;))
- What is the best way to track what the nfs server is doing, and why it stops
replying for ~1.44 seconds?
- Are there any dtrace scripts out that do work with nfs? ;-)
--
This message posted from opensolaris.org

[nfs-discuss] osol NFS server stops responding (for a few seconds)

Reply via email to