Hello list, NFS server (x4540): * SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc
and about 20 clients hanging off it, today I was logged in on cgi01 to do my tests: * SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc This system is generally running well, but every now-and-then we get periods where it slows to a crawl, even listing "df -h" output will pause before the x4500-14 output. It seems to be connected to just x4500-14, which is currently the only Solaris 10 u8 server is production (the others are running 5.11 snv_117 and appear not to be involved). I can umount and mount the file-system and everything goes back to normal. The load on the x4500-14 itself is low: load averages: 0.35, 0.33, 0.36 15:05:34 52 processes: 50 sleeping, 1 stopped, 1 on cpu CPU states: % idle, % user, % kernel, % iowait, % swap Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 832 daemon 96 60 -20 2968K 2160K sleep 92.8H 1.96% nfsd The nfsv4 domains match, the groups and passwd files are identical. All mounted with: "remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a" On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR): 172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen ) PUTFH FH=8873 OPEN OT=NC SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a 172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637 Seq=2641046014 Len=0 Win=49640 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID [lots of these] 172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID NFS4_OK CL=5684ae805ee CFV=000007B100000568 172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM CL=5684ae805ee CFV=000007B100000568 PUTROOTFH GETATTR 400 0 172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072 Seq=1242317416 Len=0 Win=49640 172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK SETCLIENTID_CONFIRM NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK 172.20.12.246 -> 172.20.12.26 NFS C CB_NULL 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=956C SAVEFH LOOKUP ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 1... 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR NFS4_OK RESTOREFH NFS4_... 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=9454 SAVEFH LOOKUP ja GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 10011a b0a23... 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=8CAD GETATTR NFS4_OK RESTOREFH NFS4_... [lots of these too] If I leave it alone for about 5 minutes, it does seem to sort itself out. I suspect it is more that all 20 clients get this happening at exactly the same time, so it takes near 15 minutes for it to recover (on its own). This would indicate that it is the x4500-14 server that is losing state. But the processes for NFS (nfsd, stat, lockd and mapid) have all the same date (days in the past) and nfsd/lockd are using 96/3 threads respectively. Which one of the snoop output MIGHT be related to the trouble? I'm unsure which ERRORS are part of standard NFS operations, and which are not. Thanks for any replies! Lund -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)