You make it sounds like it might still hang, even using the real IP. ;)

The www, ssl, cgi and navi clusters experience the NFS problem the most, 
about 2-3 times a day, so I have remounted those servers to use the real 
IP. This should tell if it makes any difference to not use the alias.

If any (other) servers experience the NFS problem, I will run the 
suggested commands.

Lund


Dai Ngo wrote:
> It's good that you now have a work-around without rebooting the client 
> or server.
> IP alias might, or might not, be a problem. However the real problem is 
> why the
> hang occurs after it has been working for awhile with the server 
> configured with
> IP alias.
> 
> I think the mount with the real IP worked because the client used a 
> different
> (source) port for new connection, 620. If you try to mount using the IP 
> alias
> I think the client will use port 664, which already hang (the original 
> problem),
> and this is why the mount failed. The reason the client uses port 664 to do
> the mount because this connection was already established to the server 
> using
> the IP alias.
> 
> You can run these commands on the server to get a little more info on 
> port 664:
> 
> # ps -ef |grep nfsd           --> get the nfsd PID
> # pfiles nfsd_PID           ---> to see all sockets nfsd are using
> # pstack nfsd_PID          --> to see what the nfsd threads are doing
> # netstat -P tcp -f inet      --> to see what state the TCP sockets are in
> 
> -Dai
> 
> Jorgen Lundman wrote:
>>
>> Ok, a server was already hung when I got to work today.
>>
>>
>> **********************************
>>
>> x4500-04: NFS Server, Sol 10 5/08
>>   Server IP (real) 172.20.12.226 netmask ffffff00
>>   NFS IP   (alias) 172.20.12.227 netmask ffffff00
>>
>> x4500-04:~# netstat -in ; netstat -rn
>> Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs 
>> Collis Queue
>> lo0   8232 127.0.0.0     127.0.0.1      1411   0     1411   0     0  0
>> e1000g0 1500 172.20.12.0   172.20.12.226  2762497849 0     1789082372 
>> 0     0      0
>> e1000g1 1500 172.20.19.0   172.20.19.226  96059758 0     52485074 0 
>> 0      0
>>
>>
>> Routing Table: IPv4
>>   Destination           Gateway           Flags  Ref     Use     
>> Interface
>> -------------------- -------------------- ----- ----- ---------- 
>> ---------
>> default              172.20.12.1          UG        1      20456
>> 172.20.12.0          172.20.12.226        U         1      45968 e1000g0
>> 172.20.12.0          172.20.12.227        U         1          0 
>> e1000g0:1
>> 172.20.19.0          172.20.19.226        U         1       1662 e1000g1
>> 224.0.0.0            172.20.12.226        U         1          0 e1000g0
>> 127.0.0.1            127.0.0.1            UH        5        316 lo0
>>
>>
>> **********************************
>>
>> NFS client: Sol 10 5/08
>>   Client IP        172.20.12.6 netmask ffffff00
>>
>> # netstat -in ; netstat -rn
>> Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs 
>> Collis Queue
>> lo0   8232 127.0.0.0     127.0.0.1      2175   0     2175   0     0  0
>> e1000g0 1500 172.20.12.0   172.20.12.6    43315618 0     41987515 0 
>> 0      0
>> e1000g1 1500 172.20.11.0   172.20.11.6    19673254 0     13928826 0 
>> 0      0
>>
>>
>> Routing Table: IPv4
>>   Destination           Gateway           Flags  Ref     Use     
>> Interface
>> -------------------- -------------------- ----- ----- ---------- 
>> ---------
>> default              172.20.11.4          UG        1      52386
>> 10.0.0.0             172.20.12.1          UG        1          0
>> 172.16.0.0           172.20.12.1          UG        1        193
>> 172.20.11.0          172.20.11.6          U         1       2406 e1000g1
>> 172.20.12.0          172.20.12.6          U         1       3163 e1000g0
>> 192.168.0.0          172.20.12.1          UG        1        120
>> 224.0.0.0            172.20.12.6          U         1          0 e1000g0
>> 127.0.0.1            127.0.0.1            UH        4       2046 lo0
>>
>>
>>
>> *********************************
>>
>>
>>
>> Snoop running on NFS Client 172.20.12.6 attempting to (re)mount volume 
>> with TCP:
>>
>> # snoop -r host 172.20.12.227 or host 172.20.12.226 &
>> # mount /export/www
>>  172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100005 (MOUNT) 
>> vers=3 proto=UDP
>> 172.20.12.226 -> 172.20.12.6  PORTMAP R GETPORT port=39049
>>  172.20.12.6 -> 172.20.12.227 MOUNT3 C Null
>> 172.20.12.226 -> 172.20.12.6  MOUNT3 R Null
>>  172.20.12.6 -> 172.20.12.227 MOUNT3 C Mount /export/www
>> 172.20.12.226 -> 172.20.12.6  MOUNT3 R Mount OK FH=D402 Auth=unix
>>  172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100003 (NFS) 
>> vers=3 proto=TCP
>> 172.20.12.226 -> 172.20.12.6  PORTMAP R GETPORT port=2049
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Syn Seq=788700586 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Syn Ack=788700587 
>> Seq=3596066619 Len=0 Win=49640 Options=<mss 1460,nop,wscale 
>> 0,nop,nop,sackOK>
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066620 
>> Seq=788700587 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 NFS C NULL3
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Ack=788700707 
>> Seq=3596066620 Len=0 Win=49520
>> 172.20.12.227 -> 172.20.12.6  NFS R NULL3
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066648 
>> Seq=788700707 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Fin Ack=3596066648 
>> Seq=788700707 Len=0 Win=49640
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Ack=788700708 
>> Seq=3596066648 Len=0 Win=49640
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Fin Ack=788700708 
>> Seq=3596066648 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066649 
>> Seq=788700708 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>>
>>
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>>
>>
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>>
>> Interesting, looks like x4500-04 is replying with the wrong IP.
>>
>>
>>
>> Packet capture on x4500-04:
>>
>> # snoop -r host 172.20.12.6
>> Using device /dev/e1000g0 (promiscuous mode)
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Rst Ack=0 
>> Seq=2924968134 Len=0 Win=49640
>> 172.20.12.227 -> 172.20.12.6  TCP D=664 S=2049 Rst Win=49640
>>  172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100005 (MOUNT) 
>> vers=3 proto=UDP
>> 172.20.12.226 -> 172.20.12.6  PORTMAP R GETPORT port=39049
>>  172.20.12.6 -> 172.20.12.227 MOUNT3 C Null
>> 172.20.12.226 -> 172.20.12.6  MOUNT3 R Null
>>  172.20.12.6 -> 172.20.12.227 MOUNT3 C Mount /export/www
>> 172.20.12.226 -> 172.20.12.6  MOUNT3 R Mount OK FH=D402 Auth=unix
>>  172.20.12.6 -> 172.20.12.227 PORTMAP C GETPORT prog=100003 (NFS) 
>> vers=3 proto=TCP
>> 172.20.12.226 -> 172.20.12.6  PORTMAP R GETPORT port=2049
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Syn Seq=788700586 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Syn Ack=788700587 
>> Seq=3596066619 Len=0 Win=49640 Options=<mss 1460,nop,wscale 
>> 0,nop,nop,sackOK>
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066620 
>> Seq=788700587 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 NFS C NULL3
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Ack=788700707 
>> Seq=3596066620 Len=0 Win=49520
>> 172.20.12.227 -> 172.20.12.6  NFS R NULL3
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066648 
>> Seq=788700707 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Fin Ack=3596066648 
>> Seq=788700707 Len=0 Win=49640
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Ack=788700708 
>> Seq=3596066648 Len=0 Win=49640
>> 172.20.12.227 -> 172.20.12.6  TCP D=63800 S=2049 Fin Ack=788700708 
>> Seq=3596066648 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=63800 Ack=3596066649 
>> Seq=788700708 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.227 -> 172.20.12.6  TCP D=664 S=2049 Ack=2876021783 
>> Seq=3544124023 Len=0 Win=49640
>>
>>
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.227 -> 172.20.12.6  TCP D=664 S=2049 Ack=2876021783 
>> Seq=3544124023 Len=0 Win=49640
>>
>>
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Syn Seq=2946510831 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.227 -> 172.20.12.6  TCP D=664 S=2049 Ack=2876021783 
>> Seq=3544124023 Len=0 Win=49640
>>
>>
>>
>> *** Attempting mount using the real IP instead of the alias:
>>
>>
>> # mount -o vers=3,hard,intr,quota 172.20.12.226:/export/www /export/www
>> ssl01:/#  172.20.12.6 -> 172.20.12.226 PORTMAP C GETPORT prog=100005 
>> (MOUNT) vers=3 proto=UDP
>> 172.20.12.226 -> 172.20.12.6  PORTMAP R GETPORT port=39049
>>  172.20.12.6 -> 172.20.12.226 MOUNT3 C Null
>> 172.20.12.226 -> 172.20.12.6  MOUNT3 R Null
>>  172.20.12.6 -> 172.20.12.226 MOUNT3 C Mount /export/www
>> 172.20.12.226 -> 172.20.12.6  MOUNT3 R Mount OK FH=D402 Auth=unix
>>  172.20.12.6 -> 172.20.12.226 PORTMAP C GETPORT prog=100003 (NFS) 
>> vers=3 proto=TCP
>> 172.20.12.226 -> 172.20.12.6  PORTMAP R GETPORT port=2049
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Syn Seq=88322761 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.226 -> 172.20.12.6  TCP D=63802 S=2049 Syn Ack=88322762 
>> Seq=3700270536 Len=0 Win=49640 Options=<mss 1460,nop,wscale 
>> 0,nop,nop,sackOK>
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Ack=3700270537 
>> Seq=88322762 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.226 NFS C NULL3
>> 172.20.12.226 -> 172.20.12.6  TCP D=63802 S=2049 Ack=88322882 
>> Seq=3700270537 Len=0 Win=49520
>> 172.20.12.226 -> 172.20.12.6  NFS R NULL3
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Ack=3700270565 
>> Seq=88322882 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Fin Ack=3700270565 
>> Seq=88322882 Len=0 Win=49640
>> 172.20.12.226 -> 172.20.12.6  TCP D=63802 S=2049 Ack=88322883 
>> Seq=3700270565 Len=0 Win=49640
>> 172.20.12.226 -> 172.20.12.6  TCP D=63802 S=2049 Fin Ack=88322883 
>> Seq=3700270565 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=63802 Ack=3700270566 
>> Seq=88322883 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.227 TCP D=2049 S=664 Rst Ack=0 
>> Seq=3056789346 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Syn Seq=1932893789 
>> Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
>> 172.20.12.226 -> 172.20.12.6  TCP D=620 S=2049 Syn Ack=1932893790 
>> Seq=3700480396 Len=0 Win=49640 Options=<mss 1460,nop,wscale 
>> 0,nop,nop,sackOK>
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Ack=3700480397 
>> Seq=1932893790 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.226 NFS C FSINFO3 FH=D402
>> 172.20.12.226 -> 172.20.12.6  TCP D=620 S=2049 Ack=1932893946 
>> Seq=3700480397 Len=0 Win=49640
>> 172.20.12.226 -> 172.20.12.6  NFS R FSINFO3 OK
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Ack=3700480565 
>> Seq=1932893946 Len=0 Win=49640
>>  172.20.12.6 -> 172.20.12.226 NFS C FSSTAT3 FH=D402
>> 172.20.12.226 -> 172.20.12.6  TCP D=620 S=2049 Ack=1932894102 
>> Seq=3700480565 Len=0 Win=49640
>> 172.20.12.226 -> 172.20.12.6  NFS R FSSTAT3 OK
>>  172.20.12.6 -> 172.20.12.226 TCP D=2049 S=620 Ack=3700480737 
>> Seq=1932894102 Len=0 Win=49640
>>
>> Which works without issue. So it is not an NFS problem, it seems to be 
>> related to alias IPs.
>>
>> Do you know a way around this? Or perhaps you can suggest a place 
>> where I can go to ask. As a quick solution we will just forgo the 
>> Alias IP and mount directly on the "real" IP. Why can I change 
>> protocol (TCP->UDP and  vv) to get around it, why can I reboot the NFS 
>> client as well. Did we create the aliases wrong?
>>
>> I apologise for the noise in NFS discussion list.
>>
>> Lund
>>
>>
>>
>> Dai Ngo wrote:
>>> The problem seems to be on the TCP connection between the client and 
>>> the nfsd on
>>> the server. The portmap and mount requests used UDP and they went OK.
>>>
>>> There are a number TCP RST packets sent from both the client and 
>>> server, this indicated
>>> there might be problem with packets lost causing both sides to be out 
>>> of sync.
>>>
>>> Looks like the server has 2 NICs on the same subnet, 172.20.12.221 
>>> and 172.20.12.220.
>>> Have you tried disable 172.20.12.220 and just use 172.20.12.221 to 
>>> see if it helps.
>>> What the output of the 'netstat -in' and 'netstat -rn' on the server 
>>> and the client look like?
>>>
>>> By the way, where were the packets captured from? on the server or 
>>> the client. It's more
>>> useful if you can capture the packets on both sides and attach the 
>>> raw capture files so
>>> they can be compared and examined in more details.
>>
> 
> 

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Reply via email to