Re: [nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

Jorgen Lundman Tue, 27 Jul 2010 20:51:57 -0700

Just a progress update post. We are still experiencing this problem, and sincewe now have 21 x4540 we see it fairly regularly.

The NOC staff remounts the cgi and vmx cluster servers every morning as it makeit occur less frequently that way.


We now monitor /var/adm/messages for the string:

Jul 26 21:05:34 cgi12.unix Suspected server reboot.

Which occurs at increasing frequency. First once a week or so, until it reachestwice a day. Each time, if left alone, it recovers more slowly. It recovers alittle faster if you restart processes, and remount file-system.

During an outbreak; nfsd appears unresponsive, but not over loaded. In the abovecase on x4500-14, the loadavg was 0.60, using ~350 threads. 'df' commands on thecgi servers would take 4-20 seconds to get statfs on this mount.

Currently, when we see the above message on a particular x4540 server, weschedule a maintenance and restart nfsd. This clears the problem for a fewweeks, for this server.


This happens most on cgi, vmx and pop clusters (in decreasing order).

At a wild guess, and I have nothing but hunches to back this up, is that nfsdhas some sort of bug where it completely loses, or invalidates, all clientidstored information. Perhaps open-files, or open-locks. This forces all NFSclients to re-negotiate all open file, or locks. This results in a networkfrenzy which takes several minutes to recover from. nfsd then appears to have hung.


The frequency of triggering the nfsd bug seems to increase with age.

At the request of our vendor, we have run GUDS, livecore, gcore, dtrace, andmany stat commands.


We still run Sol10u8 (Generic_141445-09) everywhere.

--
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
nfs-discuss mailing list
[email protected]

Re: [nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

Reply via email to