Another client: cgi01.unix 5.10 Generic_127128-11
will issue these lines when it happens: Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_OPEN got error NFS4ERR_DELAY causing recovery action NR_DELAY. Nov 27 14:01:55 cgi01.unix last message repeated 98 times Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_GETATTR got error NFS4ERR_STALE causing recovery action NR_STALE. Nov 27 14:01:55 cgi01.unix nfs: [ID 286389 kern.info] NOTICE: [NFS4][Server: x4500-07.unix][Mntpt: /export/sd03/www]File ./$customer-path/$file.lok (rnode_pt: e06c73d0) was closed due to NFS recovery error on server x4500-07.unix(failed to recover from NFS4ERR_STALE NFS4ERR_STALE) Nov 27 14:01:55 cgi01.unix nfs: [ID 941083 kern.info] NOTICE: NFS4 FACT SHEET: Nov 27 14:01:55 cgi01.unix Action: NR_STALE Nov 27 14:01:55 cgi01.unix NFS4 error: NFS4ERR_STALE Nov 27 14:03:42 cgi01.unix nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: x4500-14.unix][Mntpt: /export/sd04/www]NFS Recovery done for mount /export/sd04/www (mi 0xd9779800) on server x4500-14.unix, rnode_pt1 ./$customer-file (0xdc7496d8), rnode_pt2 ./$customer-file (0xe00af188) Nov 27 14:03:42 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: x4500-14.unix][Mntpt: /export/sd04/www]NFS op OP_OPEN got error NFS4ERR_STALE_CLIENTID causing recovery action NR_CLIENTID. Client also suspects that the server rebooted, or experienced a network partition. This only happens against NFS-Server/X4500 running Solaris 10 u8, not those running svn_117. No machine has been rebooted, no process has been restarted. They just periodically seem to lose all state. Currently once/day. Jorgen Lundman wrote: > > Hello list, > > NFS server (x4540): > * SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc > > and about 20 clients hanging off it, today I was logged in on cgi01 to > do my tests: > > * SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc > > > This system is generally running well, but every now-and-then we get > periods where it slows to a crawl, even listing "df -h" output will > pause before the x4500-14 output. It seems to be connected to just > x4500-14, which is currently the only Solaris 10 u8 server is production > (the others are running 5.11 snv_117 and appear not to be involved). > > I can umount and mount the file-system and everything goes back to normal. > > The load on the x4500-14 itself is low: > > load averages: 0.35, 0.33, 0.36 15:05:34 > 52 processes: 50 sleeping, 1 stopped, 1 on cpu > CPU states: % idle, % user, % kernel, % iowait, % swap > Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free > > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > 832 daemon 96 60 -20 2968K 2160K sleep 92.8H 1.96% nfsd > > The nfsv4 domains match, the groups and passwd files are identical. All > mounted with: > > "remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a" > > > On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR): > > 172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen ) PUTFH FH=8873 OPEN OT=NC > SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a > 172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH > NFS4_OK OPEN NFS4ERR_NO_GRACE > > > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637 > Seq=2641046014 Len=0 Win=49640 > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > [lots of these] > > 172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID > NFS4_OK CL=5684ae805ee CFV=000007B100000568 > 172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM > CL=5684ae805ee CFV=000007B100000568 PUTROOTFH GETATTR 400 0 > 172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072 > Seq=1242317416 Len=0 Win=49640 > 172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK > SETCLIENTID_CONFIRM NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK > 172.20.12.246 -> 172.20.12.26 NFS C CB_NULL > > > 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=956C SAVEFH > LOOKUP ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY > GETATTR 1... > 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH > NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR > NFS4_OK RESTOREFH NFS4_... > 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=9454 SAVEFH > LOOKUP ja GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 10011a > b0a23... > 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH > NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=8CAD GETATTR > NFS4_OK RESTOREFH NFS4_... > [lots of these too] > > If I leave it alone for about 5 minutes, it does seem to sort itself > out. I suspect it is more that all 20 clients get this happening at > exactly the same time, so it takes near 15 minutes for it to recover (on > its own). > > This would indicate that it is the x4500-14 server that is losing state. > But the processes for NFS (nfsd, stat, lockd and mapid) have all the > same date (days in the past) and nfsd/lockd are using 96/3 threads > respectively. > > Which one of the snoop output MIGHT be related to the trouble? I'm > unsure which ERRORS are part of standard NFS operations, and which are not. > > Thanks for any replies! > > Lund > > > -- Jorgen Lundman | <lundman at lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)