Another client:

cgi01.unix 5.10 Generic_127128-11

will issue these lines when it happens:

Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: 
x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_OPEN got error NFS4ERR_DELAY 
causing recovery action NR_DELAY.

Nov 27 14:01:55 cgi01.unix last message repeated 98 times

Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: 
x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_GETATTR got error 
NFS4ERR_STALE 
causing recovery action NR_STALE.

Nov 27 14:01:55 cgi01.unix nfs: [ID 286389 kern.info] NOTICE: [NFS4][Server: 
x4500-07.unix][Mntpt: /export/sd03/www]File ./$customer-path/$file.lok 
(rnode_pt: e06c73d0) was closed due to NFS recovery error on server 
x4500-07.unix(failed to recover from NFS4ERR_STALE NFS4ERR_STALE)

Nov 27 14:01:55 cgi01.unix nfs: [ID 941083 kern.info] NOTICE: NFS4 FACT SHEET:

Nov 27 14:01:55 cgi01.unix  Action: NR_STALE

Nov 27 14:01:55 cgi01.unix  NFS4 error: NFS4ERR_STALE

Nov 27 14:03:42 cgi01.unix nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: 
x4500-14.unix][Mntpt: /export/sd04/www]NFS Recovery done for mount 
/export/sd04/www (mi 0xd9779800) on server x4500-14.unix, rnode_pt1 
./$customer-file (0xdc7496d8), rnode_pt2 ./$customer-file (0xe00af188)

Nov 27 14:03:42 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: 
x4500-14.unix][Mntpt: /export/sd04/www]NFS op OP_OPEN got error 
NFS4ERR_STALE_CLIENTID causing recovery action NR_CLIENTID.  Client also 
suspects that the server rebooted, or experienced a network partition.


This only happens against NFS-Server/X4500 running Solaris 10 u8, not those 
running svn_117. No machine has been rebooted, no process has been restarted. 
They just periodically seem to lose all state. Currently once/day.




Jorgen Lundman wrote:
>
> Hello list,
>
> NFS server (x4540):
> * SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc
>
> and about 20 clients hanging off it, today I was logged in on cgi01 to
> do my tests:
>
> * SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc
>
>
> This system is generally running well, but every now-and-then we get
> periods where it slows to a crawl, even listing "df -h" output will
> pause before the x4500-14 output. It seems to be connected to just
> x4500-14, which is currently the only Solaris 10 u8 server is production
> (the others are running 5.11 snv_117 and appear not to be involved).
>
> I can umount and mount the file-system and everything goes back to normal.
>
> The load on the x4500-14 itself is low:
>
> load averages: 0.35, 0.33, 0.36 15:05:34
> 52 processes: 50 sleeping, 1 stopped, 1 on cpu
> CPU states: % idle, % user, % kernel, % iowait, % swap
> Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free
>
> PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
> 832 daemon 96 60 -20 2968K 2160K sleep 92.8H 1.96% nfsd
>
> The nfsv4 domains match, the groups and passwd files are identical. All
> mounted with:
>
> "remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a"
>
>
> On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR):
>
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen ) PUTFH FH=8873 OPEN OT=NC
> SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH
> NFS4_OK OPEN NFS4ERR_NO_GRACE
>
>
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637
> Seq=2641046014 Len=0 Win=49640
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> [lots of these]
>
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID
> NFS4_OK CL=5684ae805ee CFV=000007B100000568
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM
> CL=5684ae805ee CFV=000007B100000568 PUTROOTFH GETATTR 400 0
> 172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072
> Seq=1242317416 Len=0 Win=49640
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK
> SETCLIENTID_CONFIRM NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK
> 172.20.12.246 -> 172.20.12.26 NFS C CB_NULL
>
>
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=956C SAVEFH
> LOOKUP ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY
> GETATTR 1...
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH
> NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR
> NFS4_OK RESTOREFH NFS4_...
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=9454 SAVEFH
> LOOKUP ja GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 10011a
> b0a23...
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH
> NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=8CAD GETATTR
> NFS4_OK RESTOREFH NFS4_...
> [lots of these too]
>
> If I leave it alone for about 5 minutes, it does seem to sort itself
> out. I suspect it is more that all 20 clients get this happening at
> exactly the same time, so it takes near 15 minutes for it to recover (on
> its own).
>
> This would indicate that it is the x4500-14 server that is losing state.
> But the processes for NFS (nfsd, stat, lockd and mapid) have all the
> same date (days in the past) and nfsd/lockd are using 96/3 threads
> respectively.
>
> Which one of the snoop output MIGHT be related to the trouble? I'm
> unsure which ERRORS are part of standard NFS operations, and which are not.
>
> Thanks for any replies!
>
> Lund
>
>
>

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

Reply via email to