You make it sounds like it might still hang, even using the real IP. ;) The www, ssl, cgi and navi clusters experience the NFS problem the most, about 2-3 times a day, so I have remounted those servers to use the real IP. This should tell if it makes any difference to not use the alias.
If any (other) servers experience the NFS problem, I will run the suggested commands. Lund Dai Ngo wrote: > It's good that you now have a work-around without rebooting the client > or server. > IP alias might, or might not, be a problem. However the real problem is > why the > hang occurs after it has been working for awhile with the server > configured with > IP alias. > > I think the mount with the real IP worked because the client used a > different > (source) port for new connection, 620. If you try to mount using the IP > alias > I think the client will use port 664, which already hang (the original > problem), > and this is why the mount failed. The reason the client uses port 664 to do > the mount because this connection was already established to the server > using > the IP alias. > > You can run these commands on the server to get a little more info on > port 664: > > # ps -ef |grep nfsd --> get the nfsd PID > # pfiles nfsd_PID ---> to see all sockets nfsd are using > # pstack nfsd_PID --> to see what the nfsd threads are doing > # netstat -P tcp -f inet --> to see what state the TCP sockets are in > > -Dai > > Jorgen Lundman wrote: >> >> Ok, a server was already hung when I got to work today. >> >> >> ********************************** >> >> x4500-04: NFS Server, Sol 10 5/08 >> Server IP (real) 172.20.12.226 netmask ffffff00 >> NFS IP (alias) 172.20.12.227 netmask ffffff00 >> >> x4500-04:~# netstat -in ; netstat -rn >> Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs >> Collis Queue >> lo0 8232 127.0.0.0 127.0.0.1 1411 0 1411 0 0 0 >> e1000g0 1500 172.20.12.0 172.20.12.226 2762497849 0 1789082372 >> 0 0 0 >> e1000g1 1500 172.20.19.0 172.20.19.226 96059758 0 52485074 0 >> 0 0 >> >> >> Routing Table: IPv4 >> Destination Gateway Flags Ref Use >> Interface >> -------------------- -------------------- ----- ----- ---------- >> --------- >> default 172.20.12.1 UG 1 20456 >> 172.20.12.0 172.20.12.226 U 1 45968 e1000g0 >> 172.20.12.0 172.20.12.227 U 1 0 >> e1000g0:1 >> 172.20.19.0 172.20.19.226 U 1 1662 e1000g1 >> 224.0.0.0 172.20.12.226 U 1 0 e1000g0 >> 127.0.0.1 127.0.0.1 UH 5 316 lo0 >> >> >> ********************************** >> >> NFS client: Sol 10 5/08 >> Client IP 172.20.12.6 netmask ffffff00 >> >> # netstat -in ; netstat -rn >> Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs >> Collis Queue >> lo0 8232 127.0.0.0 127.0.0.1 2175 0 2175 0 0 0 >> e1000g0 1500 172.20.12.0 172.20.12.6 43315618 0 41987515 0 >> 0 0 >> e1000g1 1500 172.20.11.0 172.20.11.6 19673254 0 13928826 0 >> 0 0 >> >> >> Routing Table: IPv4 >> Destination Gateway Flags Ref Use >> Interface >> -------------------- -------------------- ----- ----- ---------- >> --------- >> default 172.20.11.4 UG 1 52386 >> 10.0.0.0 172.20.12.1 UG 1 0 >> 172.16.0.0 172.20.12.1 UG 1 193 >> 172.20.11.0 172.20.11.6 U 1 2406 e1000g1 >> 172.20.12.0 172.20.12.6 U 1 3163 e1000g0 >> 192.168.0.0 172.20.12.1 UG 1 120 >> 224.0.0.0 172.20.12.6 U 1 0 e1000g0 >> 127.0.0.1 127.0.0.1 UH 4 2046 lo0 >> >> >> >> ********************************* >> >> >> >> Snoop running on NFS Client 172.20.12.6 attempting to (re)mount volume >> with TCP: >> >> # snoop -r host 172.20.12.227 or host 172.20.12.226 & >> # mount /export/www >> 172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100005 (MOUNT) >> vers=3 proto=UDP >> 172.20.12.226 -> 172.20.12.6 PORTMAP R GETPORT port=39049 >> 172.20.12.6 -> 172.20.12.227 MOUNT3 C Null >> 172.20.12.226 -> 172.20.12.6 MOUNT3 R Null >> 172.20.12.6 -> 172.20.12.227 MOUNT3 C Mount /export/www >> 172.20.12.226 -> 172.20.12.6 MOUNT3 R Mount OK FH=D402 Auth=unix >> 172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100003 (NFS) >> vers=3 proto=TCP >> 172.20.12.226 -> 172.20.12.6 PORTMAP R GETPORT port=2049 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Syn Seq=788700586 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Syn Ack=788700587 >> Seq=3596066619 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066620 >> Seq=788700587 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 NFS C NULL3 >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Ack=788700707 >> Seq=3596066620 Len=0 Win=49520 >> 172.20.12.227 -> 172.20.12.6 NFS R NULL3 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066648 >> Seq=788700707 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Fin Ack=3596066648 >> Seq=788700707 Len=0 Win=49640 >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Ack=788700708 >> Seq=3596066648 Len=0 Win=49640 >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Fin Ack=788700708 >> Seq=3596066648 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066649 >> Seq=788700708 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> >> >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> >> >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> >> Interesting, looks like x4500-04 is replying with the wrong IP. >> >> >> >> Packet capture on x4500-04: >> >> # snoop -r host 172.20.12.6 >> Using device /dev/e1000g0 (promiscuous mode) >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Rst Ack=0 >> Seq=2924968134 Len=0 Win=49640 >> 172.20.12.227 -> 172.20.12.6 TCP D=664 S=2049 Rst Win=49640 >> 172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100005 (MOUNT) >> vers=3 proto=UDP >> 172.20.12.226 -> 172.20.12.6 PORTMAP R GETPORT port=39049 >> 172.20.12.6 -> 172.20.12.227 MOUNT3 C Null >> 172.20.12.226 -> 172.20.12.6 MOUNT3 R Null >> 172.20.12.6 -> 172.20.12.227 MOUNT3 C Mount /export/www >> 172.20.12.226 -> 172.20.12.6 MOUNT3 R Mount OK FH=D402 Auth=unix >> 172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100003 (NFS) >> vers=3 proto=TCP >> 172.20.12.226 -> 172.20.12.6 PORTMAP R GETPORT port=2049 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Syn Seq=788700586 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Syn Ack=788700587 >> Seq=3596066619 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066620 >> Seq=788700587 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 NFS C NULL3 >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Ack=788700707 >> Seq=3596066620 Len=0 Win=49520 >> 172.20.12.227 -> 172.20.12.6 NFS R NULL3 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066648 >> Seq=788700707 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Fin Ack=3596066648 >> Seq=788700707 Len=0 Win=49640 >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Ack=788700708 >> Seq=3596066648 Len=0 Win=49640 >> 172.20.12.227 -> 172.20.12.6 TCP D=63800 S=2049 Fin Ack=788700708 >> Seq=3596066648 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066649 >> Seq=788700708 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.227 -> 172.20.12.6 TCP D=664 S=2049 Ack=2876021783 >> Seq=3544124023 Len=0 Win=49640 >> >> >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.227 -> 172.20.12.6 TCP D=664 S=2049 Ack=2876021783 >> Seq=3544124023 Len=0 Win=49640 >> >> >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.227 -> 172.20.12.6 TCP D=664 S=2049 Ack=2876021783 >> Seq=3544124023 Len=0 Win=49640 >> >> >> >> *** Attempting mount using the real IP instead of the alias: >> >> >> # mount -o vers=3,hard,intr,quota 172.20.12.226:/export/www /export/www >> ssl01:/# 172.20.12.6 -> 172.20.12.226 PORTMAP C GETPORT prog=100005 >> (MOUNT) vers=3 proto=UDP >> 172.20.12.226 -> 172.20.12.6 PORTMAP R GETPORT port=39049 >> 172.20.12.6 -> 172.20.12.226 MOUNT3 C Null >> 172.20.12.226 -> 172.20.12.6 MOUNT3 R Null >> 172.20.12.6 -> 172.20.12.226 MOUNT3 C Mount /export/www >> 172.20.12.226 -> 172.20.12.6 MOUNT3 R Mount OK FH=D402 Auth=unix >> 172.20.12.6 -> 172.20.12.226 PORTMAP C GETPORT prog=100003 (NFS) >> vers=3 proto=TCP >> 172.20.12.226 -> 172.20.12.6 PORTMAP R GETPORT port=2049 >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Syn Seq=88322761 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.226 -> 172.20.12.6 TCP D=63802 S=2049 Syn Ack=88322762 >> Seq=3700270536 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Ack=3700270537 >> Seq=88322762 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.226 NFS C NULL3 >> 172.20.12.226 -> 172.20.12.6 TCP D=63802 S=2049 Ack=88322882 >> Seq=3700270537 Len=0 Win=49520 >> 172.20.12.226 -> 172.20.12.6 NFS R NULL3 >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Ack=3700270565 >> Seq=88322882 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Fin Ack=3700270565 >> Seq=88322882 Len=0 Win=49640 >> 172.20.12.226 -> 172.20.12.6 TCP D=63802 S=2049 Ack=88322883 >> Seq=3700270565 Len=0 Win=49640 >> 172.20.12.226 -> 172.20.12.6 TCP D=63802 S=2049 Fin Ack=88322883 >> Seq=3700270565 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Ack=3700270566 >> Seq=88322883 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Rst Ack=0 >> Seq=3056789346 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Syn Seq=1932893789 >> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK> >> 172.20.12.226 -> 172.20.12.6 TCP D=620 S=2049 Syn Ack=1932893790 >> Seq=3700480396 Len=0 Win=49640 Options=<mss 1460,nop,wscale >> 0,nop,nop,sackOK> >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Ack=3700480397 >> Seq=1932893790 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.226 NFS C FSINFO3 FH=D402 >> 172.20.12.226 -> 172.20.12.6 TCP D=620 S=2049 Ack=1932893946 >> Seq=3700480397 Len=0 Win=49640 >> 172.20.12.226 -> 172.20.12.6 NFS R FSINFO3 OK >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Ack=3700480565 >> Seq=1932893946 Len=0 Win=49640 >> 172.20.12.6 -> 172.20.12.226 NFS C FSSTAT3 FH=D402 >> 172.20.12.226 -> 172.20.12.6 TCP D=620 S=2049 Ack=1932894102 >> Seq=3700480565 Len=0 Win=49640 >> 172.20.12.226 -> 172.20.12.6 NFS R FSSTAT3 OK >> 172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Ack=3700480737 >> Seq=1932894102 Len=0 Win=49640 >> >> Which works without issue. So it is not an NFS problem, it seems to be >> related to alias IPs. >> >> Do you know a way around this? Or perhaps you can suggest a place >> where I can go to ask. As a quick solution we will just forgo the >> Alias IP and mount directly on the "real" IP. Why can I change >> protocol (TCP->UDP and vv) to get around it, why can I reboot the NFS >> client as well. Did we create the aliases wrong? >> >> I apologise for the noise in NFS discussion list. >> >> Lund >> >> >> >> Dai Ngo wrote: >>> The problem seems to be on the TCP connection between the client and >>> the nfsd on >>> the server. The portmap and mount requests used UDP and they went OK. >>> >>> There are a number TCP RST packets sent from both the client and >>> server, this indicated >>> there might be problem with packets lost causing both sides to be out >>> of sync. >>> >>> Looks like the server has 2 NICs on the same subnet, 172.20.12.221 >>> and 172.20.12.220. >>> Have you tried disable 172.20.12.220 and just use 172.20.12.221 to >>> see if it helps. >>> What the output of the 'netstat -in' and 'netstat -rn' on the server >>> and the client look like? >>> >>> By the way, where were the packets captured from? on the server or >>> the client. It's more >>> useful if you can capture the packets on both sides and attach the >>> raw capture files so >>> they can be compared and examined in more details. >> > > -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)