(Resent due to wrong sender, sorry)

Hello list!

*** NFS Servers:

x4500-01 to x4500-05
: Solaris 10 5/08, ZFS and "UFS on ZVOL" exported.
: NFSD_SERVER=1024, LOCKD_SERVER=128 average use about 900 / 20 threads.
: "bufhwm_pct,maxusers,ndquot,ncsize,ufs_ninode,clnt_max_conns,
: rpcmod:cotsmaxdupreqs,rpcmod:maxdupreqs" tweaked in /etc/system.

*** NFS Clients:

Supermicro 1U * 40
: Solaris 10 5/08
: No tweaks, Mounted as
: x4500-03:/export/mail - /export/mail nfs - yes vers=3,hard,intr,quota
: x4500-02:/export/preview - /export/preview nfs - yes vers=3,hard,intr


*** Background

Using vers=3 to have uid mapping, without the need for UID lookups. UFS
on ZVOL are mounted with "quota". ZFS exported filesystems are mounted
without. The system is live and generally works very well.

However, NFS will periodically hang. Usually to just one of the x4500
servers at a time, the solution currently is just to reboot the client.
I have attempted to fully umount all filesystems, and terminate the NFS
and RPC processes, in an attempt to remount. This will not fix it. I can
not really restart the NFSD/RPC processes on the x4500s.

Usually looks like:

# df -h
[snip]
x4500-03:/export/preview
                         23T   3.9M    23T     1%    /export/preview
NFS server x4500-01 not responding still trying
^C

Note that during this time, x4500-01 is still functioning correctly to
the other 39 servers, and x4500-02,03,04,05 are still mounted correctly
on this NFS client.

# umount /export/www
# mount /export/www
NFS server x4500-01-vip not responding still trying

Truss of the mount says:
23102:   0.0000 getpid()                                        = 23102
[23101]
23102:   0.0000 door_call(5, 0x080475A0)                        = 0
23102:   0.0001 close(5)                                        = 0
NFS server x4500-01-vip not responding still trying
^C23102:        69.0780 mount("x4500-01-vip:/export/www", "/export/www",
MS_DATA|MS_OPTIONSTR, "nfs3", 0x0806D400, 76, 0x0804777C, 1024) Err#4 EINTR

Snoop says (x4500-01 is 172.20.12.220, NFS Client is 172.20.12.16)

172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT)
vers=3 proto=UDP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967
172.20.12.16 -> 172.20.12.220 MOUNT3 C Null
172.20.12.221 -> 172.20.12.16 MOUNT3 R Null
172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www
172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix
172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
proto=TCP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Syn Seq=2255048579
Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Syn Ack=2255048580
Seq=611591914 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591915
Seq=2255048580 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 NFS C NULL3
172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048700
Seq=611591915 Len=0 Win=49520
172.20.12.220 -> 172.20.12.16 NFS R NULL3
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591943
Seq=2255048700 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Fin Ack=611591943
Seq=2255048700 Len=0 Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Ack=2255048701
Seq=611591943 Len=0 Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=54091 S=2049 Fin Ack=2255048701
Seq=611591943 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54091 Ack=611591944
Seq=2255048701 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
Seq=4284552307 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
Seq=4284552307 Len=0 Win=49640
[delay]
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
Seq=4284552307 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
Seq=4284552307 Len=0 Win=49640
[repeat, delay]


*** truss of mountd on x4500-01 while attempting mount:

# truss -Dfip 28717
28717:   6.8156 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000)  = 1
28717:   0.0002 lwp_kill(788, SIG#0)                            Err#3 ESRCH
28717:   0.0001 lwp_create(0x08047B90, LWP_DETACHED|LWP_SUSPENDED,
0x08047DB0) = 791
28717/1:         0.0002 lwp_continue(791)                               = 0
28717/791:       6.8159 lwp_create()    (returning as new lwp ...)      = 0
28717/1:         0.0001 fxstat(2, 7, 0x08047CB0)                        = 0
28717/791:       0.0003 setustack(0xFECD1A60)
28717/1:         0.0000 getmsg(7, 0x08047D8C, 0x080CC018, 0x08047DAC)   = 0
28717/791:       0.0001 schedctl()
= 0xFEFB2010
28717/1:         0.0001 open("/dev/udp", O_RDONLY)                      = 16
28717/1:         0.0001 ioctl(16, SIOCTMYADDR, 0x08047CA8)              = 0
28717/1:         0.0001 close(16)                                       = 0
28717/1:         0.0000 fxstat(2, 7, 0x08047C40)                        = 0
28717/1:         0.0000 putmsg(7, 0x08047D18, 0x080CC018, 0)            = 0
28717/1:         0.0001 write(14, "F0", 1)                              = 1
28717/791:       0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000)  = 1
28717/791:       0.0000 read(13, "F0", 16)                              = 1
28717/791:       0.0001 pollsys(0x080CAE38, 9, 0x00000000, 0x00000000)  = 1
28717/791:       0.0001 lwp_unpark(1)                                   = 0
28717/1:         0.0002 lwp_park(0x00000000, 0)                         = 0
28717/791:       0.0000 fxstat(2, 7, 0xFEA3FE40)                        = 0
28717/791:       0.0001 getmsg(7, 0xFEA3FF20, 0x080CC018, 0xFEA3FF40)   = 0
28717/791:       0.0001 open("/dev/udp", O_RDONLY)                      = 16
28717/791:       0.0000 ioctl(16, SIOCTMYADDR, 0xFEA3FE38)              = 0
28717/791:       0.0001 close(16)                                       = 0
28717/791:       0.0000 write(14, " E", 1)                              = 1
28717/1:         0.0003 pollsys(0x080CAE38, 8, 0x00000000, 0x00000000)  = 1
28717/791:       0.0001 getuid()
= 0 [0]
28717/1:         0.0001 read(13, " E", 16)                              = 1
28717/791:       0.0000 getuid()
= 0 [0]
28717/791:       0.0001 door_info(15, 0xFEA3F360)                       = 0
28717/791:       0.0001 door_call(15, 0xFEA3F3B8)                       = 0
28717/791:       0.0000 resolvepath("/export/www", "/export/www", 1024) = 18
28717/791:       0.0001 xstat(2, "/etc/dfs/sharetab", 0xFEA3F6B8)       = 0
28717/791:       0.0001 nfssys(20, 0xFEA3F860)                          = 0
28717/791:       0.0000 fxstat(2, 7, 0xFEA3F6F0)                        = 0
28717/791:       0.0000 putmsg(7, 0xFEA3F7C8, 0x080CC018, 0)            = 0
28717/791:       0.0001 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7)
= 0xFFBFFEFF [0x0000FFFF]
28717/791:       0.0000 lwp_exit()

[pause]



What IS somewhat amusing though, even though I can not mount it again
using TCP but if I change to using UDP it will mount just fine. We
changed most servers to using UDP and it seems to hang less, but it will
still eventually hang.

# mount -o proto=udp /export/www
# df -h
x4500-01-vip:/export/www
                        984G    73G   901G     8%    /export/www


Successful mount proto=udp snoop:

172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
Seq=4284552307 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1161480442 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1118215538
Seq=4284552307 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1161480443
Len=0 Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Syn Ack=1118215538
Seq=4284552306 Len=0 Win=49640 Options=<mss 1460,nop,wscale
0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Ack=1118215538
Seq=4284552307 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT)
vers=3 proto=UDP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967
172.20.12.16 -> 172.20.12.220 MOUNT3 C Null
172.20.12.221 -> 172.20.12.16 MOUNT3 R Null
172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www
172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix
172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
proto=UDP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
172.20.12.16 -> 172.20.12.220 NFS C NULL3
172.20.12.221 -> 172.20.12.16 NFS R NULL3
172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
proto=UDP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
172.20.12.16 -> 172.20.12.220 NFS C NULL3
172.20.12.221 -> 172.20.12.16 NFS R NULL3
172.20.12.16 -> 172.20.12.220 NFS C FSINFO3 FH=D502
172.20.12.221 -> 172.20.12.16 NFS R FSINFO3 OK
172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502
172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK
172.20.12.16 -> 172.20.12.220 NFS C FSSTAT3 FH=D502
172.20.12.221 -> 172.20.12.16 NFS R FSSTAT3 OK


Attempt to re-mount using TCP again, for fun

# umount /export/www
# mount /export/www
NFS server x4500-01-vip not responding still trying

172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100005 (MOUNT)
vers=3 proto=UDP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=39967
172.20.12.16 -> 172.20.12.220 MOUNT3 C Null
172.20.12.221 -> 172.20.12.16 MOUNT3 R Null
172.20.12.16 -> 172.20.12.220 MOUNT3 C Mount /export/www
172.20.12.221 -> 172.20.12.16 MOUNT3 R Mount OK FH=D502 Auth=unix
172.20.12.16 -> 172.20.12.220 PORTMAP C GETPORT prog=100003 (NFS) vers=3
proto=TCP
172.20.12.221 -> 172.20.12.16 PORTMAP R GETPORT port=2049
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Syn Seq=2389376336
Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Syn Ack=2389376337
Seq=997480070 Len=0 Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480071
Seq=2389376337 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 NFS C NULL3
172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376457
Seq=997480071 Len=0 Win=49520
172.20.12.220 -> 172.20.12.16 NFS R NULL3
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480099
Seq=2389376457 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Fin Ack=997480099
Seq=2389376457 Len=0 Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Ack=2389376458
Seq=997480099 Len=0 Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=54093 S=2049 Fin Ack=2389376458
Seq=997480099 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=54093 Ack=997480100
Seq=2389376458 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Rst Ack=0 Seq=1240043383
Len=0 Win=49640
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Rst Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383
Seq=99287825 Len=0 Win=49640
172.20.12.16 -> 172.20.12.220 TCP D=2049 S=664 Syn Seq=1249838073 Len=0
Win=49640 Options=<mss 1460,nop,wscale 0,nop,nop,sackOK>
172.20.12.220 -> 172.20.12.16 TCP D=664 S=2049 Ack=1240043383
Seq=99287825 Len=0 Win=49640


So, TCP is hung until reboot. If I reboot the NFS client it will mount
TCP just fine again. When both UDP and TCP have hung there is nothing I
can do to make it mount. We never reboot the x4500's.

So, since of the 40 odd NFS clients, we have to reboot about 6 every day
which is getting tedious, and worse than that, we do not always notice
it is stuck immediately.

We have put Solaris 10 10/08 on some NFS clients as well, but it is too
early to know if it fixes anything. We will most likely also try 10/08
on the x4500, but that is a much larger task.

Are there any NFS related patches we should explore?

Sorry for the length of this email, I wanted to include as much details
as possible and show I have tried most things in an attempt to discover
where the trouble lies.

Other Google results hinted on running out of secure ports, but netstat
shows no indication of that as far as I can tell. No entries for the
hung NFS client on the x4500. The NFS client has a relatively small
netstat -na, with the exception of 47 entries for "stream-ord".


We would appreciate any feedback on this issue, thank you.


Lund


*** Random commands while mount is hung:

# showmount -e x4500-01-vip
export list for x4500-01-vip:
/export/mail    @172.20.12, at 172.20.15
/export/www     @172.20.12, at 172.20.15
/export/dovecot @172.20.12, at 172.20.15


# rpcinfo -m x4500-01-vip
PORTMAP (version 2) statistics
NULL    SET     UNSET   GETPORT         DUMP    CALLIT
0       0/0     0/0     1503694/1503838 0       0/0

PMAP_GETPORT call statistics
prog            vers    netid     success       failure
nlockmgr        4       udp       4342          0
status          1       tcp       2             0
nlockmgr        2       udp       42            0
nlockmgr        4       tcp       1433764       0
nfs             3       udp       346           0
nfs             3       tcp       400           0
status          1       udp       49            0
mountd          1       udp       79            2
mountd          1       tcp       11            2
mountd          3       udp       654           113
rquotad         1       udp       64001         23
metad           2       tcp       3             0
smserverd       1       tcp       0             1
smserverd       1       udp       0             1
300598          1       udp       1             1
300598          1       tcp       0             1

RPCBIND (version 3) statistics
NULL    SET     UNSET   GETADDR DUMP    CALLIT  TIME    U2T     T2U
0       0/0     0/0     2/2     0       0/0     0       0       0

RPCB_GETADDR (version 3) call statistics
prog            vers    netid     success       failure
status          1       ticotsord 1             0
100133          1       ticotsord 1             0

RPCBIND (version 4) statistics
NULL    SET     UNSET   GETADDR DUMP    CALLIT  TIME    U2T     T2U
0       99/99   115/115 1/2     0       0/0     0       0       0
VERADDR INDRECT GETLIST GETSTAT
0       0       1       1

RPCB_GETADDR (version 4) call statistics
prog            vers    netid     success       failure
smserverd       1       ticlts    1             1


# rpcinfo -T tcp x4500-01-vip 100005 3
program 100005 version 3 ready and waiting



-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)


-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Reply via email to