Hello all,

First of all, I'm coming from Linux, and I'm quite new to opensolaris, so i 
might use linux terminology once in a while.

For our new shared storage server, we recently purchased a Sun x4270 loaded it 
with disks, ram and ssd's and installed opensolaris on it. This server has 
replaced our old  nfs server which ran linux. However we are running into a few 
problems now and than with the Solaris NFS server.

uname output:
SunOS athos 5.11 snv_111b i86pc i386 i86pc Solaris

First we started using NFSv4 with ~30 ZFS datasets. The linux clients could 
mount all the datasets with just one NFSv4 mount. This worked fine for a day or 
so, but than all the mounts froze and trying to access those mounts would 
result in a locked session. Trying to restart the nfs server with 'svcadm 
restart network/nfs/server' would result in nfs server trying to shut down. But 
in the ps output, i could see that the process '/usr/sbin/sharemgr stop -P nfs 
-a'  was just hanging and not doing anything i could see. Killing the process, 
even with -9 would not work. So i started it with 'truss -fv all 
/usr/sbin/sharemgr stop -P nfs -a' and it would just show it opening the files 
in /etc/dfs and than no more output. The only option that remained was to 
reboot the NFS server.

We blamed it on NFSv4, because mostly the Linux implementation isn't that 
great, so we tuned down our number of datasets to 8 and mounted those with 
NFSv3 on our clients.

But now the NFS server just stops responding once in a while. If i look at it 
with snoop, i can see the clients requests, and normally the server responds to 
it quite fast, but once in a while (once per hour or so, no set time) the 
server won't respond for ~1-2 seconds. When i look at the snoop output to track 
this server responding to clients i can see gaps:

last response:
173445 14:31:25.33526        athos -> ares NFS R ACCESS3 OK 
(read,lookup,modify,extend,delete)

1.44s delay, 159 client packets later;
173604 14:31:26.77896        athos -> artemis NFS R COMMIT3 OK

So the server stopped responding to any client requests for 1.44 seconds. The 
clients are mounted with the following options:
rw,sync,noatime,intr,timeo=3,retrans=2,soft,wsize=32768,rsize=32768,ac,acregmin=60,acregmax=600,acdirmin=60,acdirmax=600,vers=3
So understandable, because the timeout is 0.3s, the clients were spamming 
nfs: server athos not responding, timed out

I tried some dtrace scripts, both for nfsv4 and nfsv3, but those that seem 
interesting; for example the nfsv4io.d from 
http://wikis.sun.com/display/DTrace/nfsv4+Provider print an error:
dtrace: failed to compile script ./nfsv4io.d: line 12: args[ ] may not be 
referenced because probe description nfsv4:::op-* matches an unstable set of 
probes

So i have a few questions;
- Is there a way to find out what the 'sharemgr stop' was trying to do and why 
it hang? (for the next time it decides to do noting ;))
- What is the best way to track what the nfs server is doing, and why it stops 
replying for ~1.44 seconds?
- Are there any dtrace scripts out that do work with nfs? ;-)
-- 
This message posted from opensolaris.org

Reply via email to