It's pretty clear that there are a number of issues, both in the 1.4 and 1.6.x series, with extremely poor performance with "busy" fileservers or clients. Some of these have been there for a long time, others come from changes made over the life of the 1.4.x and 1.6.x series. In particular, what seems to happen is that servers (and, in some cases, whole cells) will quite happily scale up to a particular load. However once that load is exceeded instead of degrading gracefully, things will just jam up completely.
The first possible cause is journalling filesystems. Many of these flush their journals to disk at regular intervals, blocking or reducing access to the filesystem during the journal flush. This block can be enough to cause the fileserver to start queuing incoming connections, and in a site that is finely balanced, may be enough to cause performance to stall. This was made considerably worse by the fileserver performing a sync() operation every 10 seconds. This is fixed in 1.6.0 - a 1.4.x release containing the fix has yet to appear. The next cause is due to deadlocks between the client and the fileserver. The Linux dynamic vcaches code which was added in 1.4.10 is known to interact badly with fileserver callback breaks, especially in situations where the fileserver is under heavy load. There is a fix in 1.6.0, but we have yet to ship a 1.4.x release which contains it. You can also work around this particular problem by disabling dynamic vcaches in your clients. The idle dead code (present since 1.4.8, made considerably worse in 1.6.0) then exacerbates any performance problems that you may be seeing. If the client hasn't received a response from the server in a (small) number of seconds, it gives up on the request, and tries a different server. However, if the server hasn't responded because it is overloaded, or because it is waiting for a callback break, then the clients request will still be queued on the server - either taking up a valuable thread, or amongst the "calls waiting for a thread". The server will then (eventually) process the packets which it has received, and attempt to perform the operation requested by the client, which has long since gone away. In some experimental situations, idle dead can actually lead to exponential load increases on the fileserver as clients pound on a particular busy server. Hope that's of some use... Simon. _______________________________________________ OpenAFS-devel mailing list OpenAFS-devel@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-devel