Hello all, First of all, I'm coming from Linux, and I'm quite new to opensolaris, so i might use linux terminology once in a while.
For our new shared storage server, we recently purchased a Sun x4270 loaded it with disks, ram and ssd's and installed opensolaris on it. This server has replaced our old nfs server which ran linux. However we are running into a few problems now and than with the Solaris NFS server. uname output: SunOS athos 5.11 snv_111b i86pc i386 i86pc Solaris First we started using NFSv4 with ~30 ZFS datasets. The linux clients could mount all the datasets with just one NFSv4 mount. This worked fine for a day or so, but than all the mounts froze and trying to access those mounts would result in a locked session. Trying to restart the nfs server with 'svcadm restart network/nfs/server' would result in nfs server trying to shut down. But in the ps output, i could see that the process '/usr/sbin/sharemgr stop -P nfs -a' was just hanging and not doing anything i could see. Killing the process, even with -9 would not work. So i started it with 'truss -fv all /usr/sbin/sharemgr stop -P nfs -a' and it would just show it opening the files in /etc/dfs and than no more output. The only option that remained was to reboot the NFS server. We blamed it on NFSv4, because mostly the Linux implementation isn't that great, so we tuned down our number of datasets to 8 and mounted those with NFSv3 on our clients. But now the NFS server just stops responding once in a while. If i look at it with snoop, i can see the clients requests, and normally the server responds to it quite fast, but once in a while (once per hour or so, no set time) the server won't respond for ~1-2 seconds. When i look at the snoop output to track this server responding to clients i can see gaps: last response: 173445 14:31:25.33526 athos -> ares NFS R ACCESS3 OK (read,lookup,modify,extend,delete) 1.44s delay, 159 client packets later; 173604 14:31:26.77896 athos -> artemis NFS R COMMIT3 OK So the server stopped responding to any client requests for 1.44 seconds. The clients are mounted with the following options: rw,sync,noatime,intr,timeo=3,retrans=2,soft,wsize=32768,rsize=32768,ac,acregmin=60,acregmax=600,acdirmin=60,acdirmax=600,vers=3 So understandable, because the timeout is 0.3s, the clients were spamming nfs: server athos not responding, timed out I tried some dtrace scripts, both for nfsv4 and nfsv3, but those that seem interesting; for example the nfsv4io.d from http://wikis.sun.com/display/DTrace/nfsv4+Provider print an error: dtrace: failed to compile script ./nfsv4io.d: line 12: args[ ] may not be referenced because probe description nfsv4:::op-* matches an unstable set of probes So i have a few questions; - Is there a way to find out what the 'sharemgr stop' was trying to do and why it hang? (for the next time it decides to do noting ;)) - What is the best way to track what the nfs server is doing, and why it stops replying for ~1.44 seconds? - Are there any dtrace scripts out that do work with nfs? ;-) -- This message posted from opensolaris.org