It's pretty clear that there are a number of issues, both in the 1.4 and 1.6.x 
series, with extremely poor performance with "busy" fileservers or clients. 
Some of these have been there for a long time, others come from changes made 
over the life of the 1.4.x and 1.6.x series. In particular, what seems to 
happen is that servers (and, in some cases, whole cells) will quite happily 
scale up to a particular load. However once that load is exceeded instead of 
degrading gracefully, things will just jam up completely.

The first possible cause is journalling filesystems. Many of these flush their 
journals to disk at regular intervals, blocking or reducing access to the 
filesystem during the journal flush. This block can be enough to cause the 
fileserver to start queuing incoming connections, and in a site that is finely 
balanced, may be enough to cause performance to stall. This was made 
considerably worse by the fileserver performing a sync() operation every 10 
seconds. This is fixed in 1.6.0 - a 1.4.x release containing the fix has yet to 
appear.

The next cause is due to deadlocks between the client and the fileserver. The 
Linux dynamic vcaches code which was added in 1.4.10 is known to interact badly 
with fileserver callback breaks, especially in situations where the fileserver 
is under heavy load. There is a fix in 1.6.0, but we have yet to ship a 1.4.x 
release which contains it. You can also work around this particular problem by 
disabling dynamic vcaches in your clients.

The idle dead code (present since 1.4.8, made considerably worse in 1.6.0) then 
exacerbates any performance problems that you may be seeing. If the client 
hasn't received a response from the server in a (small) number of seconds, it 
gives up on the request, and tries a different server. However, if the server 
hasn't responded because it is overloaded, or because it is waiting for a 
callback break, then the clients request will still be queued on the server - 
either taking up a valuable thread, or amongst the "calls waiting for a 
thread". The server will then (eventually) process the packets which it has 
received, and attempt to perform the operation requested by the client, which 
has long since gone away. In some experimental situations, idle dead can 
actually lead to exponential load increases on the fileserver as clients pound 
on a particular busy server.

Hope that's of some use...

Simon.

_______________________________________________
OpenAFS-devel mailing list
OpenAFS-devel@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to