Hello list,

NFS server (x4540):
* SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc

and about 20 clients hanging off it, today I was logged in on cgi01 to do my 
tests:

* SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc


This system is generally running well, but every now-and-then we get periods 
where it slows to a crawl, even listing "df -h" output will pause before the 
x4500-14 output.  It seems to be connected to just x4500-14, which is currently 
the only Solaris 10 u8 server is production (the others are running 5.11 
snv_117 
and appear not to be involved).

I can umount and mount the file-system and everything goes back to normal.

The load on the x4500-14 itself is low:

load averages:  0.35,  0.33,  0.36                                     15:05:34
52 processes:  50 sleeping, 1 stopped, 1 on cpu
CPU states:     % idle,     % user,     % kernel,     % iowait,     % swap
Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free

    PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
    832 daemon    96  60  -20 2968K 2160K sleep   92.8H  1.96% nfsd

The nfsv4 domains match, the groups and passwd files are identical. All mounted 
with:

"remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a"


On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR):

172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen      ) PUTFH FH=8873 OPEN OT=NC 
SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a
172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen      ) NFS4ERR_NO_GRACE PUTFH 
NFS4_OK OPEN NFS4ERR_NO_GRACE


172.20.12.246 -> 172.20.12.26 NFS R 4 (open        ) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637 Seq=2641046014 
Len=0 Win=49640
172.20.12.246 -> 172.20.12.26 NFS R 4 (open        ) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 NFS R 4 (open        ) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 NFS R 4 (open        ) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 NFS R 4 (open        ) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
[lots of these]

172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID 
NFS4_OK 
CL=5684ae805ee CFV=000007B100000568
172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM 
CL=5684ae805ee CFV=000007B100000568 PUTROOTFH GETATTR 400 0
172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072 Seq=1242317416 
Len=0 Win=49640
172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK 
SETCLIENTID_CONFIRM 
NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK
172.20.12.246 -> 172.20.12.26 NFS C CB_NULL


172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup      ) PUTFH FH=956C SAVEFH 
LOOKUP 
ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 1...
172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup      ) NFS4ERR_SAME PUTFH NFS4_OK 
SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR NFS4_OK RESTOREFH 
NFS4_...
172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup      ) PUTFH FH=9454 SAVEFH 
LOOKUP 
ja GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 10011a b0a23...
172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup      ) NFS4ERR_SAME PUTFH NFS4_OK 
SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=8CAD GETATTR NFS4_OK RESTOREFH 
NFS4_...
[lots of these too]

If I leave it alone for about 5 minutes, it does seem to sort itself out. I 
suspect it is more that all 20 clients get this happening at exactly the same 
time, so it takes near 15 minutes for it to recover (on its own).

This would indicate that it is the x4500-14 server that is losing state. But 
the 
processes for NFS (nfsd, stat, lockd and mapid) have all the same date (days in 
the past) and nfsd/lockd are using 96/3 threads respectively.

Which one of the snoop output MIGHT be related to the trouble? I'm unsure which 
ERRORS are part of standard NFS operations, and which are not.

Thanks for any replies!

Lund



-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Reply via email to