Ok, it still happens even when not using aliases, it just took longer to turn up.
Attempting to mount (snoop running on NFS client) bash-3.00# mount 172.20.12.228:/export/mail /mnt nfs mount: 172.20.12.228: : RPC: Program not registered nfs mount: retrying: /mnt 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1021 Syn Seq=2365435228 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1021 S=2049 Rst Ack=2365435229 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1020 Syn Seq=2242896555 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1020 S=2049 Rst Ack=2242896556 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1019 Syn Seq=1448368793 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1019 S=2049 Rst Ack=1448368794 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1018 Syn Seq=883538524 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1018 S=2049 Rst Ack=883538525 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 NFS C FSSTAT3 FH=D702 172.20.12.228 -> 172.20.12.21 ICMP Destination unreachable (UDP port 2049 unreachable) 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1017 Syn Seq=3028937941 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1016 Syn Seq=3821439944 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1016 S=2049 Rst Ack=3821439945 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1015 Syn Seq=1966482573 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1015 S=2049 Rst Ack=1966482574 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 ^C bash-3.00# 172.20.12.21 -> 172.20.12.228 TCP D=2049 S=1014 Syn Seq=2696600158 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=1014 S=2049 Rst Ack=2696600159 Win=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=3 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 172.20.12.21 -> 172.20.12.228 PORTMAP C GETPORT prog=100005 (MOUNT) vers=2 proto=UDP 172.20.12.228 -> 172.20.12.21 PORTMAP R GETPORT port=0 # rpcinfo 172.20.12.228 rpcinfo: can't contact rpcbind: : RPC: Unable to receive; errno = Connection refused; System error bash-3.00# 172.20.12.21 -> 172.20.12.228 TCP D=111 S=50279 Syn Seq=3313033773 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=50279 S=111 Rst Ack=3313033774 Win=0 172.20.12.21 -> 172.20.12.228 TCP D=111 S=54373 Syn Seq=3383588494 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> 172.20.12.228 -> 172.20.12.21 TCP D=54373 S=111 Rst Ack=3383588495 Win=0 172.20.12.21 -> 172.20.12.228 RPCBIND C DUMP 172.20.12.228 -> 172.20.12.21 ICMP Destination unreachable (UDP port 111 unreachable) From a different nfs client: rpcinfo 172.20.12.228 program version netid address service owner 100000 4 ticots x4500-05.unix.rpc rpcbind superuser 100000 3 ticots x4500-05.unix.rpc rpcbind superuser 100000 4 ticotsord x4500-05.unix.rpc rpcbind superuser 100000 3 ticotsord x4500-05.unix.rpc rpcbind superuser 100000 4 ticlts x4500-05.unix.rpc rpcbind superuser 100000 3 ticlts x4500-05.unix.rpc rpcbind superuser 100000 4 tcp 0.0.0.0.0.111 rpcbind superuser 100000 3 tcp 0.0.0.0.0.111 rpcbind superuser 100000 2 tcp 0.0.0.0.0.111 rpcbind superuser 100000 4 udp 0.0.0.0.0.111 rpcbind superuser 100000 3 udp 0.0.0.0.0.111 rpcbind superuser 100000 2 udp 0.0.0.0.0.111 rpcbind superuser 100024 1 udp 0.0.0.0.128.10 status superuser 100024 1 tcp 0.0.0.0.128.3 status superuser 100024 1 ticlts \021\000\000\000 status superuser 100024 1 ticotsord \024\000\000\000 status superuser 100024 1 ticots \027\000\000\000 status superuser 100133 1 udp 0.0.0.0.128.10 - superuser 100133 1 tcp 0.0.0.0.128.3 - superuser 100133 1 ticlts \021\000\000\000 - superuser 100133 1 ticotsord \024\000\000\000 - superuser 100133 1 ticots \027\000\000\000 - superuser 100021 1 udp 0.0.0.0.15.205 nlockmgr 1 1073741824 1 tcp 0.0.0.0.128.4 - 1 100021 2 udp 0.0.0.0.15.205 nlockmgr 1 100021 3 udp 0.0.0.0.15.205 nlockmgr 1 100021 4 udp 0.0.0.0.15.205 nlockmgr 1 100021 1 tcp 0.0.0.0.15.205 nlockmgr 1 100021 2 tcp 0.0.0.0.15.205 nlockmgr 1 100021 3 tcp 0.0.0.0.15.205 nlockmgr 1 100021 4 tcp 0.0.0.0.15.205 nlockmgr 1 100155 1 ticotsord l\000\000\000 smserverd superuser 100011 1 ticlts o\000\000\000 rquotad superuser 100011 1 udp 0.0.0.0.128.18 rquotad superuser 100231 1 ticlts x4500-05.unix.nfsauth - superuser 100231 1 ticotsord x4500-05.unix.nfsauth - superuser 100231 1 ticots x4500-05.unix.nfsauth - superuser 100005 1 udp 0.0.0.0.128.19 mountd superuser 100005 1 ticlts \203\000\000\000 mountd superuser 100005 1 tcp 0.0.0.0.128.13 mountd superuser 100005 1 ticotsord \210\000\000\000 mountd superuser 100005 1 ticots \213\000\000\000 mountd superuser 100005 2 udp 0.0.0.0.128.19 mountd superuser 100005 2 ticlts \203\000\000\000 mountd superuser 100005 2 tcp 0.0.0.0.128.13 mountd superuser 100005 2 ticotsord \210\000\000\000 mountd superuser 100005 2 ticots \213\000\000\000 mountd superuser 100005 3 udp 0.0.0.0.128.19 mountd superuser 100005 3 ticlts \203\000\000\000 mountd superuser 100005 3 tcp 0.0.0.0.128.13 mountd superuser 100005 3 ticotsord \210\000\000\000 mountd superuser 100005 3 ticots \213\000\000\000 mountd superuser 100003 2 udp 0.0.0.0.8.1 nfs 1 100003 3 udp 0.0.0.0.8.1 nfs 1 100227 2 udp 0.0.0.0.8.1 nfs_acl 1 100227 3 udp 0.0.0.0.8.1 nfs_acl 1 100003 2 tcp 0.0.0.0.8.1 nfs 1 100003 3 tcp 0.0.0.0.8.1 nfs 1 100003 4 tcp 0.0.0.0.8.1 nfs 1 100227 2 tcp 0.0.0.0.8.1 nfs_acl 1 100227 3 tcp 0.0.0.0.8.1 nfs_acl 1 Why would the NFS client not be able to talk to the server? 3 minutes later, rpcinfo got unstuck and NFS was back again. Without me doing anything but snoop. Lund Robert van Veelen wrote: > I will try this out on my test hosts. Are you using NFSv3 exclusively? There > are no v4 clients in your env? If the clients are exclusively ro, have you > tried mointing with ro flag? > > -rob > > > -----Original Message----- > From: Jorgen Lundman [mailto:lundman at gmo.jp] > Sent: Monday, February 16, 2009 12:11 AM Eastern Standard Time > To: Robert van Veelen > Subject: Re: [nfs-discuss] NFS hanging with RPC timeout. > > > I have not had time to prove this, but by asking the other admins which > NFS server mounts hung, nobody could remember x4500-01 ever hanging. The > reason I asked them was because x4500-01 is the only one where the > "alias" is lower IP than the real IP. > > 01-alias: .220 > 01-real: .221 > 02-real: .222 > 02-alias: .223 > 03-real: .224 > 03-alias: .225 > 04-real: .226 > 04-alias: .227 > > Now, IP "value" should not matter, I know, but it just "felt" like it > was related. :) > > Lund > > > > Robert van Veelen wrote: >> I have a handful of x4150s to play with this week. I'll drop some ipmi >> addresses on them and see if I can reproduce the symptoms that you are >> describing. Would be interesting to see. Always good to know where one's >> bugs lie. >> >> -rob >> >> >> -----Original Message----- >> From: Jorgen Lundman [mailto:lundman at gmo.jp] >> Sent: Sunday, February 15, 2009 11:47 PM Eastern Standard Time >> To: Robert van Veelen >> Subject: Re: [nfs-discuss] NFS hanging with RPC timeout. >> >> >> The servers that hang the most are www and navi (apache), which is >> nearly exclusively read-only. Servers like FTP, and vmx have yet to >> hang at all. It sure do not make much sense here. >> >> Strangely enough, navi servers (3 reboots a day before) "appears" to do >> a lot better (only one reboot in 2 days), but now we see a lot of: >> >> Feb 16 13:40:28 navi01.unix nfs: [ID 333984 kern.notice] NFS server >> 172.20.12.224 not responding still trying >> Feb 16 13:40:41 navi01.unix nfs: [ID 563706 kern.notice] NFS server >> 172.20.12.224 ok >> Feb 16 13:41:47 navi01.unix nfs: [ID 333984 kern.notice] NFS server >> 172.20.12.224 not responding still trying >> Feb 16 13:42:03 navi01.unix nfs: [ID 563706 kern.notice] NFS server >> 172.20.12.224 ok >> Feb 16 13:42:28 navi01.unix nfs: [ID 333984 kern.notice] NFS server >> 172.20.12.224 not responding still trying >> >> >> Even though the x4500 is just fine, and talk to nav01 just fine even >> during one of these "stalls". Services appear unaffected. >> >> Oh bugger, I just noticed navi servers are not 5/08. That would possibly >> explain why they are the worst of all. I will upgrade these servers >> asap. (SunOS navi01.unix 5.11 snv_40 i86pc i386 i86pc) >> >> Lund >> >> Robert van Veelen wrote: >>> Jorgen, >>> I have been following the back and forth on the list with the ip alias >>> info. It does seem like a strange case. It would be interesting if you >>> found some connection to the hangs we are seeing but it also appears >>> unlikely now. >>> For what it's worth, the patch I specified was rolled into the 10/08 >>> release and cannot be removed trivially. This is how I was burned while >>> deploying to our first qa hosts for 10/08 then through the backported patch >>> to 5/08. >>> Are you writing/reading to or from the shared nfs space directly on the >>> server side? This seems to be a key factor in my steps to recreate our >>> hang. >>> Good luck, >>> >>> -rob >>> >>> >>> -----Original Message----- >>> From: Jorgen Lundman [mailto:lundman at gmo.jp] >>> Sent: Sunday, February 15, 2009 07:55 PM Eastern Standard Time >>> To: Robert van Veelen >>> Subject: Re: [nfs-discuss] NFS hanging with RPC timeout. >>> >>> >>> >>> Hello, >>> >>> Sorry, I only just show your mail now, my mail filters were not smart >>> enough to move it to the right place :) I do not think I have 137138-09 >>> installed on the SOl 10 5/08 servers, but it appears installed on the >>> 10/08. But most recent findings seem to indicate that the problem we are >>> having are with IP aliases. Currently testing this hypothesis. >>> >>> Lund >>> >>> >>> >>> Robert van Veelen wrote: >>>> Do you have Solaris patch 137138-09 installed? You may need to back that >>>> out until a permanent fix is posted. The issue that you are describing >>>> sounds exactly like a problem that I have seen on similar machines here. >>>> In my testing the only workaround was to back out the kernel patch >>>> 137138-09 on the clients (server can remain as is). If you have a support >>>> contract then I would also open a case with sun as there appears to be a >>>> regression in the kernel code. At this point I can reproduce the deadlock >>>> within 30 seconds. >>>> You might be able to reference my open case for this issue. I will forward >>>> more info if you find that this is the same issue. Good luck. >>>> Regards, >>>> >>>> -rob >>>> >>>> >>>> -----Original Message----- >>>> From: Jorgen Lundman [mailto:lundman at gmo.jp] >>>> Sent: Monday, February 09, 2009 09:21 PM Eastern Standard Time >>>> To: nfs-discuss at opensolaris.org >>>> Subject: [nfs-discuss] NFS hanging with RPC timeout. >>>> >>>> (Resent due to wrong sender, sorry) >>>> >>>> >>>> Hello list! >>>> >>>> *** NFS Servers: >>>> >>>> x4500-01 to x4500-05 >>>> : Solaris 10 5/08, ZFS and "UFS on ZVOL" exported. >>>> : NFSD_SERVER=1024, LOCKD_SERVER=128 average use about 900 / 20 threads. >>>> : "bufhwm_pct,maxusers,ndquot,ncsize,ufs_ninode,clnt_max_conns, >>>> : rpcmod:cotsmaxdupreqs,rpcmod:maxdupreqs" tweaked in /etc/system. >>>> >>>> *** NFS Clients: >>>> >>>> Supermicro 1U * 40 >>>> : Solaris 10 5/08 >>>> : No tweaks, Mounted as >>>> : x4500-03:/export/mail - /export/mail nfs - yes vers=3,hard,intr,quota >>>> : x4500-02:/export/preview - /export/preview nfs - yes vers=3,hard,intr >>>> >>>> >>>> *** Background >>>> >>>> Using vers=3 to have uid mapping, without the need for UID lookups. UFS >>>> on ZVOL are mounted with "quota". ZFS exported filesystems are mounted >>>> without. The system is live and generally works very well. >>>> >>>> However, NFS will periodically hang. Usually to just one of the x4500 >>>> servers at a time, the solution currently is just to reboot the client. >>>> I have attempted to fully umount all filesystems, and terminate the NFS >>>> and RPC processes, in an attempt to remount. This will not fix it. I can >>>> not really restart the NFSD/RPC processes on the x4500s. >>>> >>>> Usually looks like: >>>> >>>> # df -h >>>> [snip] >>>> x4500-03:/export/preview >>>> 23T 3.9M 23T 1% /export/preview >>>> NFS server x4500-01 not responding still trying >>>> ^C >>>> >>>> Note that during this time, x4500-01 is still functioning correctly to >>>> the other 39 servers, and x4500-02,03,04,05 are still mounted correctly >>>> on this NFS client. >>>> >>>> # umount /export/www >>>> # mount /export/www >>>> NFS server x4500-01-vip not responding still trying >>>> >>>> Truss of the mount says: >>>> 23102: 0.0000 getpid() = 23102 >>>> [23101] >>>> 23102: 0.0000 door_call(5, 0x080475A0) = 0 >>>> 23102: 0.0001 close(5) = 0 >>>> NFS server x4500-01-vip not responding still trying >>>> ^C23102: 69.0780 mount("x4500-01-vip:/export/www", "/export/www", >>>> MS_DATA|MS_OPTIONSTR, "nfs3", 0x0806D400, 76, 0x0804777C, 1024) Err#4 EINTR >>>> >>>> Snoop says (x4500-01 is 172.20.12.220, NFS Client is 172.20.12.16) >>>> >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) >>>> vers=3 proto=UDP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 >>>> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null >>>> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null >>>> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www >>>> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >>>> proto=TCP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Syn Seq=2255048579 >>>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Syn Ack=2255048580 >>>> Seq=611591914 Len=0 Win=49640 Options=<mss 1460,nop,wscale >>>> 0,nop,nop,sackOK> >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591915 >>>> Seq=2255048580 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048700 >>>> Seq=611591915 Len=0 Win=49520 >>>> 172.20.12.220 -> 172.20.12.16 NFS R NULL3 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591943 >>>> Seq=2255048700 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Fin Ack=611591943 >>>> Seq=2255048700 Len=0 Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048701 >>>> Seq=611591943 Len=0 Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Fin Ack=2255048701 >>>> Seq=611591943 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591944 >>>> Seq=2255048701 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> [delay] >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> [repeat, delay] >>>> >>>> >>>> *** truss of mountd on x4500-01 while attempting mount: >>>> >>>> # truss -Dfip 28717 >>>> 28717: 6.8156 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 >>>> 28717: 0.0002 lwp_kill(788, SIG#0) Err#3 ESRCH >>>> 28717: 0.0001 lwp_create(0x08047B90, LWP_DETACHED|LWP_SUSPENDED, >>>> 0x08047DB0) = 791 >>>> 28717/1: 0.0002 lwp_continue(791) = 0 >>>> 28717/791: 6.8159 lwp_create() (returning as new lwp ...) = 0 >>>> 28717/1: 0.0001 fxstat(2, 7, 0x08047CB0) = 0 >>>> 28717/791: 0.0003 setustack(0xFECD1A60) >>>> 28717/1: 0.0000 getmsg(7, 0x08047D8C, 0x080CC018, 0x08047DAC) = 0 >>>> 28717/791: 0.0001 schedctl() >>>> = 0xFEFB2010 >>>> 28717/1: 0.0001 open("/dev/udp", O_RDONLY) = >>>> 16 >>>> 28717/1: 0.0001 ioctl(16, SIOCTMYADDR, 0x08047CA8) = 0 >>>> 28717/1: 0.0001 close(16) = 0 >>>> 28717/1: 0.0000 fxstat(2, 7, 0x08047C40) = 0 >>>> 28717/1: 0.0000 putmsg(7, 0x08047D18, 0x080CC018, 0) = 0 >>>> 28717/1: 0.0001 write(14, "F0", 1) = 1 >>>> 28717/791: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000) = 1 >>>> 28717/791: 0.0000 read(13, "F0", 16) = 1 >>>> 28717/791: 0.0001 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000) = 1 >>>> 28717/791: 0.0001 lwp_unpark(1) = 0 >>>> 28717/1: 0.0002 lwp_park(0x00000000, 0) = 0 >>>> 28717/791: 0.0000 fxstat(2, 7, 0xFEA3FE40) = 0 >>>> 28717/791: 0.0001 getmsg(7, 0xFEA3FF20, 0x080CC018, 0xFEA3FF40) = 0 >>>> 28717/791: 0.0001 open("/dev/udp", O_RDONLY) = >>>> 16 >>>> 28717/791: 0.0000 ioctl(16, SIOCTMYADDR, 0xFEA3FE38) = 0 >>>> 28717/791: 0.0001 close(16) = 0 >>>> 28717/791: 0.0000 write(14, " E", 1) = 1 >>>> 28717/1: 0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000) = 1 >>>> 28717/791: 0.0001 getuid() >>>> = 0 [0] >>>> 28717/1: 0.0001 read(13, " E", 16) = 1 >>>> 28717/791: 0.0000 getuid() >>>> = 0 [0] >>>> 28717/791: 0.0001 door_info(15, 0xFEA3F360) = 0 >>>> 28717/791: 0.0001 door_call(15, 0xFEA3F3B8) = 0 >>>> 28717/791: 0.0000 resolvepath("/export/www", "/export/www", 1024) = >>>> 18 >>>> 28717/791: 0.0001 xstat(2, "/etc/dfs/sharetab", 0xFEA3F6B8) = 0 >>>> 28717/791: 0.0001 nfssys(20, 0xFEA3F860) = 0 >>>> 28717/791: 0.0000 fxstat(2, 7, 0xFEA3F6F0) = 0 >>>> 28717/791: 0.0000 putmsg(7, 0xFEA3F7C8, 0x080CC018, 0) = 0 >>>> 28717/791: 0.0001 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) >>>> = 0xFFBFFEFF [0x0000FFFF] >>>> 28717/791: 0.0000 lwp_exit() >>>> >>>> [pause] >>>> >>>> >>>> >>>> What IS somewhat amusing though, even though I can not mount it again >>>> using TCP but if I change to using UDP it will mount just fine. We >>>> changed most servers to using UDP and it seems to hang less, but it will >>>> still eventually hang. >>>> >>>> # mount -o proto=udp /export/www >>>> # df -h >>>> x4500-01-vip:/export/www >>>> 984G 73G 901G 8% /export/www >>>> >>>> >>>> Successful mount proto=udp snoop: >>>> >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1161480443 >>>> Len=0 Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Syn Ack=1118215538 >>>> Seq=4284552306 Len=0 Win=49640 Options=<mss 1460,nop,wscale >>>> 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Ack=1118215538 >>>> Seq=4284552307 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) >>>> vers=3 proto=UDP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 >>>> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null >>>> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null >>>> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www >>>> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >>>> proto=UDP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >>>> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >>>> 172.20.12.221 -> 172.20.12.16 NFS R NULL3 >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >>>> proto=UDP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >>>> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >>>> 172.20.12.221 -> 172.20.12.16 NFS R NULL3 >>>> 172.20.12.16 -> 172.20.12.220 NFS C FSINFO3 FH=D502 >>>> 172.20.12.221 -> 172.20.12.16 NFS R FSINFO3 OK >>>> 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 >>>> 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK >>>> 172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502 >>>> 172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK >>>> >>>> >>>> Attempt to re-mount using TCP again, for fun >>>> >>>> # umount /export/www >>>> # mount /export/www >>>> NFS server x4500-01-vip not responding still trying >>>> >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT) >>>> vers=3 proto=UDP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967 >>>> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Null >>>> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Null >>>> 172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www >>>> 172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix >>>> 172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3 >>>> proto=TCP >>>> 172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Syn Seq=2389376336 >>>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Syn Ack=2389376337 >>>> Seq=997480070 Len=0 Win=49640 Options=<mss 1460,nop,wscale >>>> 0,nop,nop,sackOK> >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480071 >>>> Seq=2389376337 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 NFS C NULL3 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376457 >>>> Seq=997480071 Len=0 Win=49520 >>>> 172.20.12.220 -> 172.20.12.16 NFS R NULL3 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480099 >>>> Seq=2389376457 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Fin Ack=997480099 >>>> Seq=2389376457 Len=0 Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376458 >>>> Seq=997480099 Len=0 Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Fin Ack=2389376458 >>>> Seq=997480099 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480100 >>>> Seq=2389376458 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1240043383 >>>> Len=0 Win=49640 >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 >>>> Seq=99287825 Len=0 Win=49640 >>>> 172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0 >>>> Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >>>> 172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383 >>>> Seq=99287825 Len=0 Win=49640 >>>> >>>> >>>> So, TCP is hung until reboot. If I reboot the NFS client it will mount >>>> TCP just fine again. When both UDP and TCP have hung there is nothing I >>>> can do to make it mount. We never reboot the x4500's. >>>> >>>> So, since of the 40 odd NFS clients, we have to reboot about 6 every day >>>> which is getting tedious, and worse than that, we do not always notice >>>> it is stuck immediately. >>>> >>>> We have put Solaris 10 10/08 on some NFS clients as well, but it is too >>>> early to know if it fixes anything. We will most likely also try 10/08 >>>> on the x4500, but that is a much larger task. >>>> >>>> Are there any NFS related patches we should explore? >>>> >>>> Sorry for the length of this email, I wanted to include as much details >>>> as possible and show I have tried most things in an attempt to discover >>>> where the trouble lies. >>>> >>>> Other Google results hinted on running out of secure ports, but netstat >>>> shows no indication of that as far as I can tell. No entries for the >>>> hung NFS client on the x4500. The NFS client has a relatively small >>>> netstat -na, with the exception of 47 entries for "stream-ord". >>>> >>>> >>>> We would appreciate any feedback on this issue, thank you. >>>> >>>> >>>> Lund >>>> >>>> >>>> *** Random commands while mount is hung: >>>> >>>> # showmount -e x4500-01-vip >>>> export list for x4500-01-vip: >>>> /export/mail @172.20.12, at 172.20.15 >>>> /export/www @172.20.12, at 172.20.15 >>>> /export/dovecot @172.20.12, at 172.20.15 >>>> >>>> >>>> # rpcinfo -m x4500-01-vip >>>> PORTMAP (version 2) statistics >>>> NULL SET UNSET GETPORT DUMP CALLIT >>>> 0 0/0 0/0 1503694/1503838 0 0/0 >>>> >>>> PMAP_GETPORT call statistics >>>> prog vers netid success failure >>>> nlockmgr 4 udp 4342 0 >>>> status 1 tcp 2 0 >>>> nlockmgr 2 udp 42 0 >>>> nlockmgr 4 tcp 1433764 0 >>>> nfs 3 udp 346 0 >>>> nfs 3 tcp 400 0 >>>> status 1 udp 49 0 >>>> mountd 1 udp 79 2 >>>> mountd 1 tcp 11 2 >>>> mountd 3 udp 654 113 >>>> rquotad 1 udp 64001 23 >>>> metad 2 tcp 3 0 >>>> smserverd 1 tcp 0 1 >>>> smserverd 1 udp 0 1 >>>> 300598 1 udp 1 1 >>>> 300598 1 tcp 0 1 >>>> >>>> RPCBIND (version 3) statistics >>>> NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U >>>> 0 0/0 0/0 2/2 0 0/0 0 0 0 >>>> >>>> RPCB_GETADDR (version 3) call statistics >>>> prog vers netid success failure >>>> status 1 ticotsord 1 0 >>>> 100133 1 ticotsord 1 0 >>>> >>>> RPCBIND (version 4) statistics >>>> NULL SET UNSET GETADDR DUMP CALLIT TIME U2T T2U >>>> 0 99/99 115/115 1/2 0 0/0 0 0 0 >>>> VERADDR INDRECT GETLIST GETSTAT >>>> 0 0 1 1 >>>> >>>> RPCB_GETADDR (version 4) call statistics >>>> prog vers netid success failure >>>> smserverd 1 ticlts 1 1 >>>> >>>> >>>> # rpcinfo -T tcp x4500-01-vip 100005 3 >>>> program 100005 version 3 ready and waiting >>>> >>>> >>>> > -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)