(Resent due to wrong sender, sorry)
Hello list! *** NFS Servers: x4500-01 to x4500-05 : Solaris 10 5/08, ZFS and "UFS on ZVOL" exported. : NFSD_SERVER=1024, LOCKD_SERVER=128 average use about 900 / 20 threads. : "bufhwm_pct,maxusers,ndquot,ncsize,ufs_ninode,clnt_max_conns, : rpcmod:cotsmaxdupreqs,rpcmod:maxdupreqs" tweaked in /etc/system. *** NFS Clients: Supermicro 1U * 40 : Solaris 10 5/08 : No tweaks, Mounted as : x4500-03:/export/mail - /export/mail nfs - yes vers=3,hard,intr,quota : x4500-02:/export/preview - /export/preview nfs - yes vers=3,hard,intr *** Background Using vers=3 to have uid mapping, without the need for UID lookups. UFS on ZVOL are mounted with "quota". ZFS exported filesystems are mounted without. The system is live and generally works very well. However, NFS will periodically hang. Usually to just one of the x4500 servers at a time, the solution currently is just to reboot the client. I have attempted to fully umount all filesystems, and terminate the NFS and RPC processes, in an attempt to remount. This will not fix it. I can not really restart the NFSD/RPC processes on the x4500s. Usually looks like: # df -h [snip] x4500-03:/export/preview 23T 3.9M 23T 1% /export/preview NFS server x4500-01 not responding still trying ^C Note that during this time, x4500-01 is still functioning correctly to the other 39 servers, and x4500-02,03,04,05 are still mounted correctly on this NFS client. # umount /export/www # mount /export/www NFS server x4500-01-vip not responding still trying Truss of the mount says: 23102: 0.0000 getpid() = 23102 [23101] 23102: 0.0000 door_call(5, 0x080475A0) = 0 23102: 0.0001 close(5) = 0 NFS server x4500-01-vip not responding still trying ^C23102: 69.0780 mount("x4500-01-vip:/export/www", "/export/www", MS_DATA|MS_OPTIONSTR, "nfs3", 0x0806D400, 76, 0x0804777C, 1024) Err#4 EINTR Snoop says (x4500-01 is 172.20.12.220, NFS Client is 172.20.12.16) 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 proto=TCP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Syn Seq=2255048579 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Syn Ack=2255048580 Seq=611591914 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591915 Seq=2255048580 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 NFS C NULL3 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048700 Seq=611591915 Len=0 Win=49520 172.20.12.220 -> 172.20.12.16 NFS R NULL3 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591943 Seq=2255048700 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Fin Ack=611591943 Seq=2255048700 Len=0 Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048701 Seq=611591943 Len=0 Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Fin Ack=2255048701 Seq=611591943 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591944 Seq=2255048701 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 Seq=4284552307 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 Seq=4284552307 Len=0 Win=49640 [delay] 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 Seq=4284552307 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 Seq=4284552307 Len=0 Win=49640 [repeat, delay] *** truss of mountd on x4500-01 while attempting mount: # truss -Dfip 28717 28717: 6.8156 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 28717: 0.0002 lwp_kill(788, SIG#0) Err#3 ESRCH 28717: 0.0001 lwp_create(0x08047B90, LWP_DETACHED|LWP_SUSPENDED, 0x08047DB0) = 791 28717/1: 0.0002 lwp_continue(791) = 0 28717/791: 6.8159 lwp_create() (returning as new lwp ...) = 0 28717/1: 0.0001 fxstat(2, 7, 0x08047CB0) = 0 28717/791: 0.0003 setustack(0xFECD1A60) 28717/1: 0.0000 getmsg(7, 0x08047D8C, 0x080CC018, 0x08047DAC) = 0 28717/791: 0.0001 schedctl() = 0xFEFB2010 28717/1: 0.0001 open("/dev/udp", O_RDONLY) = 16 28717/1: 0.0001 ioctl(16, SIOCTMYADDR, 0x08047CA8) = 0 28717/1: 0.0001 close(16) = 0 28717/1: 0.0000 fxstat(2, 7, 0x08047C40) = 0 28717/1: 0.0000 putmsg(7, 0x08047D18, 0x080CC018, 0) = 0 28717/1: 0.0001 write(14, "F0", 1) = 1 28717/791: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000) = 1 28717/791: 0.0000 read(13, "F0", 16) = 1 28717/791: 0.0001 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 28717/791: 0.0001 lwp_unpark(1) = 0 28717/1: 0.0002 lwp_park(0x00000000, 0) = 0 28717/791: 0.0000 fxstat(2, 7, 0xFEA3FE40) = 0 28717/791: 0.0001 getmsg(7, 0xFEA3FF20, 0x080CC018, 0xFEA3FF40) = 0 28717/791: 0.0001 open("/dev/udp", O_RDONLY) = 16 28717/791: 0.0000 ioctl(16, SIOCTMYADDR, 0xFEA3FE38) = 0 28717/791: 0.0001 close(16) = 0 28717/791: 0.0000 write(14, " E", 1) = 1 28717/1: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000) = 1 28717/791: 0.0001 getuid() = 0 [0] 28717/1: 0.0001 read(13, " E", 16) = 1 28717/791: 0.0000 getuid() = 0 [0] 28717/791: 0.0001 door_info(15, 0xFEA3F360) = 0 28717/791: 0.0001 door_call(15, 0xFEA3F3B8) = 0 28717/791: 0.0000 resolvepath("/export/www", "/export/www", 1024) = 18 28717/791: 0.0001 xstat(2, "/etc/dfs/sharetab", 0xFEA3F6B8) = 0 28717/791: 0.0001 nfssys(20, 0xFEA3F860) = 0 28717/791: 0.0000 fxstat(2, 7, 0xFEA3F6F0) = 0 28717/791: 0.0000 putmsg(7, 0xFEA3F7C8, 0x080CC018, 0) = 0 28717/791: 0.0001 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] 28717/791: 0.0000 lwp_exit() [pause] What IS somewhat amusing though, even though I can not mount it again using TCP but if I change to using UDP it will mount just fine. We changed most servers to using UDP and it seems to hang less, but it will still eventually hang. # mount -o proto=udp /export/www # df -h x4500-01-vip:/export/www 984G 73G 901G 8% /export/www Successful mount proto=udp snoop: 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 Seq=4284552307 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 Seq=4284552307 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1161480443 Len=0 Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Syn Ack=1118215538 Seq=4284552306 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Ack=1118215538 Seq=4284552307 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 proto=UDP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 172.20.12.16 -> 172.20.12.220 NFS C NULL3 172.20.12.221 -> 172.20.12.16 NFS R NULL3 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 proto=UDP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 172.20.12.16 -> 172.20.12.220 NFS C NULL3 172.20.12.221 -> 172.20.12.16 NFS R NULL3 172.20.12.16 -> 172.20.12.220 NFS C FSINFO3 FH=D502 172.20.12.221 -> 172.20.12.16 NFS R FSINFO3 OK 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK Attempt to re-mount using TCP again, for fun # umount /export/www # mount /export/www NFS server x4500-01-vip not responding still trying 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 proto=TCP 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Syn Seq=2389376336 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Syn Ack=2389376337 Seq=997480070 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480071 Seq=2389376337 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 NFS C NULL3 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376457 Seq=997480071 Len=0 Win=49520 172.20.12.220 -> 172.20.12.16 NFS R NULL3 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480099 Seq=2389376457 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Fin Ack=997480099 Seq=2389376457 Len=0 Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376458 Seq=997480099 Len=0 Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Fin Ack=2389376458 Seq=997480099 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480100 Seq=2389376458 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1240043383 Len=0 Win=49640 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 Seq=99287825 Len=0 Win=49640 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 Seq=99287825 Len=0 Win=49640 So, TCP is hung until reboot. If I reboot the NFS client it will mount TCP just fine again. When both UDP and TCP have hung there is nothing I can do to make it mount. We never reboot the x4500's. So, since of the 40 odd NFS clients, we have to reboot about 6 every day which is getting tedious, and worse than that, we do not always notice it is stuck immediately. We have put Solaris 10 10/08 on some NFS clients as well, but it is too early to know if it fixes anything. We will most likely also try 10/08 on the x4500, but that is a much larger task. Are there any NFS related patches we should explore? Sorry for the length of this email, I wanted to include as much details as possible and show I have tried most things in an attempt to discover where the trouble lies. Other Google results hinted on running out of secure ports, but netstat shows no indication of that as far as I can tell. No entries for the hung NFS client on the x4500. The NFS client has a relatively small netstat -na, with the exception of 47 entries for "stream-ord". We would appreciate any feedback on this issue, thank you. Lund *** Random commands while mount is hung: # showmount -e x4500-01-vip export list for x4500-01-vip: /export/mail @172.20.12, at 172.20.15 /export/www @172.20.12, at 172.20.15 /export/dovecot @172.20.12, at 172.20.15 # rpcinfo -m x4500-01-vip PORTMAP (version 2) statistics NULL SET UNSET GETPORT DUMP CALLIT 0 0/0 0/0 1503694/1503838 0 0/0 PMAP_GETPORT call statistics prog vers netid success failure nlockmgr 4 udp 4342 0 status 1 tcp 2 0 nlockmgr 2 udp 42 0 nlockmgr 4 tcp 1433764 0 nfs 3 udp 346 0 nfs 3 tcp 400 0 status 1 udp 49 0 mountd 1 udp 79 2 mountd 1 tcp 11 2 mountd 3 udp 654 113 rquotad 1 udp 64001 23 metad 2 tcp 3 0 smserverd 1 tcp 0 1 smserverd 1 udp 0 1 300598 1 udp 1 1 300598 1 tcp 0 1 RPCBIND (version 3) statistics NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U 0 0/0 0/0 2/2 0 0/0 0 0 0 RPCB_GETADDR (version 3) call statistics prog vers netid success failure status 1 ticotsord 1 0 100133 1 ticotsord 1 0 RPCBIND (version 4) statistics NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U 0 99/99 115/115 1/2 0 0/0 0 0 0 VERADDR INDRECT GETLIST GETSTAT 0 0 1 1 RPCB_GETADDR (version 4) call statistics prog vers netid success failure smserverd 1 ticlts 1 1 # rpcinfo -T tcp x4500-01-vip 100005 3 program 100005 version 3 ready and waiting -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home) -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)