Thank you for your reply, The x4500s uses a real IP, and an IP alias. The NFS mounts are connected to the alias, so it would be "easier" to fail-over to a different x4500 should there be a need. We have not yet got that far as we are exploring ways to replicate the data from the active x4500 to a passive x4500 first.
But I will test the idea that the alias might be involved, see if it mounts the real IP etc. I will report back. Dai Ngo wrote: > The problem seems to be on the TCP connection between the client and the > nfsd on > the server. The portmap and mount requests used UDP and they went OK. > > There are a number TCP RST packets sent from both the client and server, > this indicated > there might be problem with packets lost causing both sides to be out of > sync. > > Looks like the server has 2 NICs on the same subnet, 172.20.12.221 and > 172.20.12.220. > Have you tried disable 172.20.12.220 and just use 172.20.12.221 to see > if it helps. > What the output of the 'netstat -in' and 'netstat -rn' on the server and > the client look like? > > By the way, where were the packets captured from? on the server or the > client. It's more > useful if you can capture the packets on both sides and attach the raw > capture files so > they can be compared and examined in more details. > > -Dai > > Jorgen Lundman wrote: >> (Resent due to wrong sender, sorry) >> >> >> Hello list! >> >> *** NFS Servers: >> >> x4500-01 to x4500-05 >> : Solaris 10 5/08, ZFS and "UFS on ZVOL" exported. >> : NFSD_SERVER=1024, LOCKD_SERVER=128 average use about 900 / 20 threads. >> : "bufhwm_pct,maxusers,ndquot,ncsize,ufs_ninode,clnt_max_conns, >> : rpcmod:cotsmaxdupreqs,rpcmod:maxdupreqs" tweaked in /etc/system. >> >> *** NFS Clients: >> >> Supermicro 1U * 40 >> : Solaris 10 5/08 >> : No tweaks, Mounted as >> : x4500-03:/export/mail - /export/mail nfs - yes vers=3,hard,intr,quota >> : x4500-02:/export/preview - /export/preview nfs - yes vers=3,hard,intr >> >> >> *** Background >> >> Using vers=3 to have uid mapping, without the need for UID lookups. UFS >> on ZVOL are mounted with "quota". ZFS exported filesystems are mounted >> without. The system is live and generally works very well. >> >> However, NFS will periodically hang. Usually to just one of the x4500 >> servers at a time, the solution currently is just to reboot the client. >> I have attempted to fully umount all filesystems, and terminate the NFS >> and RPC processes, in an attempt to remount. This will not fix it. I can >> not really restart the NFSD/RPC processes on the x4500s. >> >> Usually looks like: >> >> # df -h >> [snip] >> x4500-03:/export/preview >> 23T 3.9M 23T 1% /export/preview >> NFS server x4500-01 not responding still trying >> ^C >> >> Note that during this time, x4500-01 is still functioning correctly to >> the other 39 servers, and x4500-02,03,04,05 are still mounted correctly >> on this NFS client. >> >> # umount /export/www >> # mount /export/www >> NFS server x4500-01-vip not responding still trying >> >> Truss of the mount says: >> 23102: 0.0000 getpid() = 23102 >> [23101] >> 23102: 0.0000 door_call(5, 0x080475A0) = 0 >> 23102: 0.0001 close(5) = 0 >> NFS server x4500-01-vip not responding still trying >> ^C23102: 69.0780 mount("x4500-01-vip:/export/www", "/export/www", >> MS_DATA|MS_OPTIONSTR, "nfs3", 0x0806D400, 76, 0x0804777C, 1024) Err#4 >> EINTR >> >> Snoop says (x4500-01 is 172.20.12.220, NFS Client is 172.20.12.16) >> >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) >> vers=3 proto=UDP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 >> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null >> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null >> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www >> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >> proto=TCP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Syn Seq=2255048579 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Syn Ack=2255048580 >> Seq=611591914 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591915 >> Seq=2255048580 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048700 >> Seq=611591915 Len=0 Win=49520 >> 172.20.12.220 -> 172.20.12.16 NFS R NULL3 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591943 >> Seq=2255048700 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Fin Ack=611591943 >> Seq=2255048700 Len=0 Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048701 >> Seq=611591943 Len=0 Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Fin Ack=2255048701 >> Seq=611591943 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591944 >> Seq=2255048701 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> [delay] >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> [repeat, delay] >> >> >> *** truss of mountd on x4500-01 while attempting mount: >> >> # truss -Dfip 28717 >> 28717: 6.8156 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 >> 28717: 0.0002 lwp_kill(788, SIG#0) Err#3 >> ESRCH >> 28717: 0.0001 lwp_create(0x08047B90, LWP_DETACHED|LWP_SUSPENDED, >> 0x08047DB0) = 791 >> 28717/1: 0.0002 >> lwp_continue(791) = 0 >> 28717/791: 6.8159 lwp_create() (returning as new lwp >> ...) = 0 >> 28717/1: 0.0001 fxstat(2, 7, >> 0x08047CB0) = 0 >> 28717/791: 0.0003 setustack(0xFECD1A60) >> 28717/1: 0.0000 getmsg(7, 0x08047D8C, 0x080CC018, >> 0x08047DAC) = 0 >> 28717/791: 0.0001 schedctl() >> = 0xFEFB2010 >> 28717/1: 0.0001 open("/dev/udp", >> O_RDONLY) = 16 >> 28717/1: 0.0001 ioctl(16, SIOCTMYADDR, >> 0x08047CA8) = 0 >> 28717/1: 0.0001 >> close(16) = 0 >> 28717/1: 0.0000 fxstat(2, 7, >> 0x08047C40) = 0 >> 28717/1: 0.0000 putmsg(7, 0x08047D18, 0x080CC018, >> 0) = 0 >> 28717/1: 0.0001 write(14, "F0", >> 1) = 1 >> 28717/791: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, >> 0x00000000) = 1 >> 28717/791: 0.0000 read(13, "F0", >> 16) = 1 >> 28717/791: 0.0001 pollsys(0x080CAE38, 9, 0x00000000, >> 0x00000000) = 1 >> 28717/791: 0.0001 >> lwp_unpark(1) = 0 >> 28717/1: 0.0002 lwp_park(0x00000000, >> 0) = 0 >> 28717/791: 0.0000 fxstat(2, 7, >> 0xFEA3FE40) = 0 >> 28717/791: 0.0001 getmsg(7, 0xFEA3FF20, 0x080CC018, >> 0xFEA3FF40) = 0 >> 28717/791: 0.0001 open("/dev/udp", >> O_RDONLY) = 16 >> 28717/791: 0.0000 ioctl(16, SIOCTMYADDR, >> 0xFEA3FE38) = 0 >> 28717/791: 0.0001 >> close(16) = 0 >> 28717/791: 0.0000 write(14, " E", >> 1) = 1 >> 28717/1: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, >> 0x00000000) = 1 >> 28717/791: 0.0001 getuid() >> = 0 [0] >> 28717/1: 0.0001 read(13, " E", >> 16) = 1 >> 28717/791: 0.0000 getuid() >> = 0 [0] >> 28717/791: 0.0001 door_info(15, >> 0xFEA3F360) = 0 >> 28717/791: 0.0001 door_call(15, >> 0xFEA3F3B8) = 0 >> 28717/791: 0.0000 resolvepath("/export/www", "/export/www", >> 1024) = 18 >> 28717/791: 0.0001 xstat(2, "/etc/dfs/sharetab", >> 0xFEA3F6B8) = 0 >> 28717/791: 0.0001 nfssys(20, >> 0xFEA3F860) = 0 >> 28717/791: 0.0000 fxstat(2, 7, >> 0xFEA3F6F0) = 0 >> 28717/791: 0.0000 putmsg(7, 0xFEA3F7C8, 0x080CC018, >> 0) = 0 >> 28717/791: 0.0001 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) >> = 0xFFBFFEFF [0x0000FFFF] >> 28717/791: 0.0000 lwp_exit() >> >> [pause] >> >> >> >> What IS somewhat amusing though, even though I can not mount it again >> using TCP but if I change to using UDP it will mount just fine. We >> changed most servers to using UDP and it seems to hang less, but it will >> still eventually hang. >> >> # mount -o proto=udp /export/www >> # df -h >> x4500-01-vip:/export/www >> 984G 73G 901G 8% /export/www >> >> >> Successful mount proto=udp snoop: >> >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1161480443 >> Len=0 Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Syn Ack=1118215538 >> Seq=4284552306 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Ack=1118215538 >> Seq=4284552307 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) >> vers=3 proto=UDP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 >> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null >> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null >> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www >> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >> proto=UDP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >> 172.20.12.221 -> 172.20.12.16 NFS R NULL3 >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >> proto=UDP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >> 172.20.12.221 -> 172.20.12.16 NFS R NULL3 >> 172.20.12.16 -> 172.20.12.220 NFS C FSINFO3 FH=D502 >> 172.20.12.221 -> 172.20.12.16 NFS R FSINFO3 OK >> 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 >> 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK >> 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 >> 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK >> >> >> Attempt to re-mount using TCP again, for fun >> >> # umount /export/www >> # mount /export/www >> NFS server x4500-01-vip not responding still trying >> >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) >> vers=3 proto=UDP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 >> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null >> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null >> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www >> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix >> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >> proto=TCP >> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Syn Seq=2389376336 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Syn Ack=2389376337 >> Seq=997480070 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480071 >> Seq=2389376337 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376457 >> Seq=997480071 Len=0 Win=49520 >> 172.20.12.220 -> 172.20.12.16 NFS R NULL3 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480099 >> Seq=2389376457 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Fin Ack=997480099 >> Seq=2389376457 Len=0 Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376458 >> Seq=997480099 Len=0 Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Fin Ack=2389376458 >> Seq=997480099 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480100 >> Seq=2389376458 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1240043383 >> Len=0 Win=49640 >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 >> Seq=99287825 Len=0 Win=49640 >> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 >> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 >> Seq=99287825 Len=0 Win=49640 >> >> >> So, TCP is hung until reboot. If I reboot the NFS client it will mount >> TCP just fine again. When both UDP and TCP have hung there is nothing I >> can do to make it mount. We never reboot the x4500's. >> >> So, since of the 40 odd NFS clients, we have to reboot about 6 every day >> which is getting tedious, and worse than that, we do not always notice >> it is stuck immediately. >> >> We have put Solaris 10 10/08 on some NFS clients as well, but it is too >> early to know if it fixes anything. We will most likely also try 10/08 >> on the x4500, but that is a much larger task. >> >> Are there any NFS related patches we should explore? >> >> Sorry for the length of this email, I wanted to include as much details >> as possible and show I have tried most things in an attempt to discover >> where the trouble lies. >> >> Other Google results hinted on running out of secure ports, but netstat >> shows no indication of that as far as I can tell. No entries for the >> hung NFS client on the x4500. The NFS client has a relatively small >> netstat -na, with the exception of 47 entries for "stream-ord". >> >> >> We would appreciate any feedback on this issue, thank you. >> >> >> Lund >> >> >> *** Random commands while mount is hung: >> >> # showmount -e x4500-01-vip >> export list for x4500-01-vip: >> /export/mail @172.20.12, at 172.20.15 >> /export/www @172.20.12, at 172.20.15 >> /export/dovecot @172.20.12, at 172.20.15 >> >> >> # rpcinfo -m x4500-01-vip >> PORTMAP (version 2) statistics >> NULL SET UNSET GETPORT DUMP CALLIT >> 0 0/0 0/0 1503694/1503838 0 0/0 >> >> PMAP_GETPORT call statistics >> prog vers netid success failure >> nlockmgr 4 udp 4342 0 >> status 1 tcp 2 0 >> nlockmgr 2 udp 42 0 >> nlockmgr 4 tcp 1433764 0 >> nfs 3 udp 346 0 >> nfs 3 tcp 400 0 >> status 1 udp 49 0 >> mountd 1 udp 79 2 >> mountd 1 tcp 11 2 >> mountd 3 udp 654 113 >> rquotad 1 udp 64001 23 >> metad 2 tcp 3 0 >> smserverd 1 tcp 0 1 >> smserverd 1 udp 0 1 >> 300598 1 udp 1 1 >> 300598 1 tcp 0 1 >> >> RPCBIND (version 3) statistics >> NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U >> 0 0/0 0/0 2/2 0 0/0 0 0 0 >> >> RPCB_GETADDR (version 3) call statistics >> prog vers netid success failure >> status 1 ticotsord 1 0 >> 100133 1 ticotsord 1 0 >> >> RPCBIND (version 4) statistics >> NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U >> 0 99/99 115/115 1/2 0 0/0 0 0 0 >> VERADDR INDRECT GETLIST GETSTAT >> 0 0 1 1 >> >> RPCB_GETADDR (version 4) call statistics >> prog vers netid success failure >> smserverd 1 ticlts 1 1 >> >> >> # rpcinfo -T tcp x4500-01-vip 100005 3 >> program 100005 version 3 ready and waiting >> >> >> >> > > -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)