The problem seems to be on the TCP connection between the client and the nfsd on the server. The portmap and mount requests used UDP and they went OK.
There are a number TCP RST packets sent from both the client and server, this indicated there might be problem with packets lost causing both sides to be out of sync. Looks like the server has 2 NICs on the same subnet, 172.20.12.221 and 172.20.12.220. Have you tried disable 172.20.12.220 and just use 172.20.12.221 to see if it helps. What the output of the 'netstat -in' and 'netstat -rn' on the server and the client look like? By the way, where were the packets captured from? on the server or the client. It's more useful if you can capture the packets on both sides and attach the raw capture files so they can be compared and examined in more details. -Dai Jorgen Lundman wrote: > (Resent due to wrong sender, sorry) > > > Hello list! > > *** NFS Servers: > > x4500-01 to x4500-05 > : Solaris 10 5/08, ZFS and "UFS on ZVOL" exported. > : NFSD_SERVER=1024, LOCKD_SERVER=128 average use about 900 / 20 threads. > : "bufhwm_pct,maxusers,ndquot,ncsize,ufs_ninode,clnt_max_conns, > : rpcmod:cotsmaxdupreqs,rpcmod:maxdupreqs" tweaked in /etc/system. > > *** NFS Clients: > > Supermicro 1U * 40 > : Solaris 10 5/08 > : No tweaks, Mounted as > : x4500-03:/export/mail - /export/mail nfs - yes vers=3,hard,intr,quota > : x4500-02:/export/preview - /export/preview nfs - yes vers=3,hard,intr > > > *** Background > > Using vers=3 to have uid mapping, without the need for UID lookups. UFS > on ZVOL are mounted with "quota". ZFS exported filesystems are mounted > without. The system is live and generally works very well. > > However, NFS will periodically hang. Usually to just one of the x4500 > servers at a time, the solution currently is just to reboot the client. > I have attempted to fully umount all filesystems, and terminate the NFS > and RPC processes, in an attempt to remount. This will not fix it. I can > not really restart the NFSD/RPC processes on the x4500s. > > Usually looks like: > > # df -h > [snip] > x4500-03:/export/preview > 23T 3.9M 23T 1% /export/preview > NFS server x4500-01 not responding still trying > ^C > > Note that during this time, x4500-01 is still functioning correctly to > the other 39 servers, and x4500-02,03,04,05 are still mounted correctly > on this NFS client. > > # umount /export/www > # mount /export/www > NFS server x4500-01-vip not responding still trying > > Truss of the mount says: > 23102: 0.0000 getpid() = 23102 > [23101] > 23102: 0.0000 door_call(5, 0x080475A0) = 0 > 23102: 0.0001 close(5) = 0 > NFS server x4500-01-vip not responding still trying > ^C23102: 69.0780 mount("x4500-01-vip:/export/www", "/export/www", > MS_DATA|MS_OPTIONSTR, "nfs3", 0x0806D400, 76, 0x0804777C, 1024) Err#4 EINTR > > Snoop says (x4500-01 is 172.20.12.220, NFS Client is 172.20.12.16) > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) > vers=3 proto=UDP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 > proto=TCP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Syn Seq=2255048579 > Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Syn Ack=2255048580 > Seq=611591914 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591915 > Seq=2255048580 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 NFS C NULL3 > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048700 > Seq=611591915 Len=0 Win=49520 > 172.20.12.220 -> 172.20.12.16 NFS R NULL3 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591943 > Seq=2255048700 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Fin Ack=611591943 > Seq=2255048700 Len=0 Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048701 > Seq=611591943 Len=0 Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Fin Ack=2255048701 > Seq=611591943 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591944 > Seq=2255048701 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > [delay] > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > [repeat, delay] > > > *** truss of mountd on x4500-01 while attempting mount: > > # truss -Dfip 28717 > 28717: 6.8156 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 > 28717: 0.0002 lwp_kill(788, SIG#0) Err#3 ESRCH > 28717: 0.0001 lwp_create(0x08047B90, LWP_DETACHED|LWP_SUSPENDED, > 0x08047DB0) = 791 > 28717/1: 0.0002 lwp_continue(791) = 0 > 28717/791: 6.8159 lwp_create() (returning as new lwp ...) = 0 > 28717/1: 0.0001 fxstat(2, 7, 0x08047CB0) = 0 > 28717/791: 0.0003 setustack(0xFECD1A60) > 28717/1: 0.0000 getmsg(7, 0x08047D8C, 0x080CC018, 0x08047DAC) = 0 > 28717/791: 0.0001 schedctl() > = 0xFEFB2010 > 28717/1: 0.0001 open("/dev/udp", O_RDONLY) = 16 > 28717/1: 0.0001 ioctl(16, SIOCTMYADDR, 0x08047CA8) = 0 > 28717/1: 0.0001 close(16) = 0 > 28717/1: 0.0000 fxstat(2, 7, 0x08047C40) = 0 > 28717/1: 0.0000 putmsg(7, 0x08047D18, 0x080CC018, 0) = 0 > 28717/1: 0.0001 write(14, "F0", 1) = 1 > 28717/791: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000) = 1 > 28717/791: 0.0000 read(13, "F0", 16) = 1 > 28717/791: 0.0001 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 > 28717/791: 0.0001 lwp_unpark(1) = 0 > 28717/1: 0.0002 lwp_park(0x00000000, 0) = 0 > 28717/791: 0.0000 fxstat(2, 7, 0xFEA3FE40) = 0 > 28717/791: 0.0001 getmsg(7, 0xFEA3FF20, 0x080CC018, 0xFEA3FF40) = 0 > 28717/791: 0.0001 open("/dev/udp", O_RDONLY) = 16 > 28717/791: 0.0000 ioctl(16, SIOCTMYADDR, 0xFEA3FE38) = 0 > 28717/791: 0.0001 close(16) = 0 > 28717/791: 0.0000 write(14, " E", 1) = 1 > 28717/1: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000) = 1 > 28717/791: 0.0001 getuid() > = 0 [0] > 28717/1: 0.0001 read(13, " E", 16) = 1 > 28717/791: 0.0000 getuid() > = 0 [0] > 28717/791: 0.0001 door_info(15, 0xFEA3F360) = 0 > 28717/791: 0.0001 door_call(15, 0xFEA3F3B8) = 0 > 28717/791: 0.0000 resolvepath("/export/www", "/export/www", 1024) = 18 > 28717/791: 0.0001 xstat(2, "/etc/dfs/sharetab", 0xFEA3F6B8) = 0 > 28717/791: 0.0001 nfssys(20, 0xFEA3F860) = 0 > 28717/791: 0.0000 fxstat(2, 7, 0xFEA3F6F0) = 0 > 28717/791: 0.0000 putmsg(7, 0xFEA3F7C8, 0x080CC018, 0) = 0 > 28717/791: 0.0001 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) > = 0xFFBFFEFF [0x0000FFFF] > 28717/791: 0.0000 lwp_exit() > > [pause] > > > > What IS somewhat amusing though, even though I can not mount it again > using TCP but if I change to using UDP it will mount just fine. We > changed most servers to using UDP and it seems to hang less, but it will > still eventually hang. > > # mount -o proto=udp /export/www > # df -h > x4500-01-vip:/export/www > 984G 73G 901G 8% /export/www > > > Successful mount proto=udp snoop: > > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1161480443 > Len=0 Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Syn Ack=1118215538 > Seq=4284552306 Len=0 Win=49640 Options=<mss 1460,nop,wscale > 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Ack=1118215538 > Seq=4284552307 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) > vers=3 proto=UDP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 > proto=UDP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 > 172.20.12.16 -> 172.20.12.220 NFS C NULL3 > 172.20.12.221 -> 172.20.12.16 NFS R NULL3 > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 > proto=UDP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 > 172.20.12.16 -> 172.20.12.220 NFS C NULL3 > 172.20.12.221 -> 172.20.12.16 NFS R NULL3 > 172.20.12.16 -> 172.20.12.220 NFS C FSINFO3 FH=D502 > 172.20.12.221 -> 172.20.12.16 NFS R FSINFO3 OK > 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 > 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK > 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 > 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK > > > Attempt to re-mount using TCP again, for fun > > # umount /export/www > # mount /export/www > NFS server x4500-01-vip not responding still trying > > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) > vers=3 proto=UDP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null > 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www > 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix > 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 > proto=TCP > 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Syn Seq=2389376336 > Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Syn Ack=2389376337 > Seq=997480070 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480071 > Seq=2389376337 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 NFS C NULL3 > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376457 > Seq=997480071 Len=0 Win=49520 > 172.20.12.220 -> 172.20.12.16 NFS R NULL3 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480099 > Seq=2389376457 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Fin Ack=997480099 > Seq=2389376457 Len=0 Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376458 > Seq=997480099 Len=0 Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Fin Ack=2389376458 > Seq=997480099 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480100 > Seq=2389376458 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1240043383 > Len=0 Win=49640 > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 > Seq=99287825 Len=0 Win=49640 > 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 > Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> > 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 > Seq=99287825 Len=0 Win=49640 > > > So, TCP is hung until reboot. If I reboot the NFS client it will mount > TCP just fine again. When both UDP and TCP have hung there is nothing I > can do to make it mount. We never reboot the x4500's. > > So, since of the 40 odd NFS clients, we have to reboot about 6 every day > which is getting tedious, and worse than that, we do not always notice > it is stuck immediately. > > We have put Solaris 10 10/08 on some NFS clients as well, but it is too > early to know if it fixes anything. We will most likely also try 10/08 > on the x4500, but that is a much larger task. > > Are there any NFS related patches we should explore? > > Sorry for the length of this email, I wanted to include as much details > as possible and show I have tried most things in an attempt to discover > where the trouble lies. > > Other Google results hinted on running out of secure ports, but netstat > shows no indication of that as far as I can tell. No entries for the > hung NFS client on the x4500. The NFS client has a relatively small > netstat -na, with the exception of 47 entries for "stream-ord". > > > We would appreciate any feedback on this issue, thank you. > > > Lund > > > *** Random commands while mount is hung: > > # showmount -e x4500-01-vip > export list for x4500-01-vip: > /export/mail @172.20.12, at 172.20.15 > /export/www @172.20.12, at 172.20.15 > /export/dovecot @172.20.12, at 172.20.15 > > > # rpcinfo -m x4500-01-vip > PORTMAP (version 2) statistics > NULL SET UNSET GETPORT DUMP CALLIT > 0 0/0 0/0 1503694/1503838 0 0/0 > > PMAP_GETPORT call statistics > prog vers netid success failure > nlockmgr 4 udp 4342 0 > status 1 tcp 2 0 > nlockmgr 2 udp 42 0 > nlockmgr 4 tcp 1433764 0 > nfs 3 udp 346 0 > nfs 3 tcp 400 0 > status 1 udp 49 0 > mountd 1 udp 79 2 > mountd 1 tcp 11 2 > mountd 3 udp 654 113 > rquotad 1 udp 64001 23 > metad 2 tcp 3 0 > smserverd 1 tcp 0 1 > smserverd 1 udp 0 1 > 300598 1 udp 1 1 > 300598 1 tcp 0 1 > > RPCBIND (version 3) statistics > NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U > 0 0/0 0/0 2/2 0 0/0 0 0 0 > > RPCB_GETADDR (version 3) call statistics > prog vers netid success failure > status 1 ticotsord 1 0 > 100133 1 ticotsord 1 0 > > RPCBIND (version 4) statistics > NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U > 0 99/99 115/115 1/2 0 0/0 0 0 0 > VERADDR INDRECT GETLIST GETSTAT > 0 0 1 1 > > RPCB_GETADDR (version 4) call statistics > prog vers netid success failure > smserverd 1 ticlts 1 1 > > > # rpcinfo -T tcp x4500-01-vip 100005 3 > program 100005 version 3 ready and waiting > > > >