Just a progress update post. We are still experiencing this problem, and since we now have 21 x4540 we see it fairly regularly.

The NOC staff remounts the cgi and vmx cluster servers every morning as it make it occur less frequently that way.

We now monitor /var/adm/messages for the string:

Jul 26 21:05:34 cgi12.unix Suspected server reboot.

Which occurs at increasing frequency. First once a week or so, until it reaches twice a day. Each time, if left alone, it recovers more slowly. It recovers a little faster if you restart processes, and remount file-system.

During an outbreak; nfsd appears unresponsive, but not over loaded. In the above case on x4500-14, the loadavg was 0.60, using ~350 threads. 'df' commands on the cgi servers would take 4-20 seconds to get statfs on this mount.

Currently, when we see the above message on a particular x4540 server, we schedule a maintenance and restart nfsd. This clears the problem for a few weeks, for this server.

This happens most on cgi, vmx and pop clusters (in decreasing order).

At a wild guess, and I have nothing but hunches to back this up, is that nfsd has some sort of bug where it completely loses, or invalidates, all clientid stored information. Perhaps open-files, or open-locks. This forces all NFS clients to re-negotiate all open file, or locks. This results in a network frenzy which takes several minutes to recover from. nfsd then appears to have hung.

The frequency of triggering the nfsd bug seems to increase with age.

At the request of our vendor, we have run GUDS, livecore, gcore, dtrace, and many stat commands.

We still run Sol10u8 (Generic_141445-09) everywhere.

--
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
nfs-discuss mailing list
[email protected]

Reply via email to