We're currently seeing a complex of serious issues with OpenAFS file servers that we believe may be related to this issue, the vnode locking problem, and a few other related issues. Here is some additional information to help in judging whether this is related.
We started seeing problems when moving from Debian lenny running 1.4.11 file servers to Debian squeeze running 1.4.14 file servers. This happened at roughly the same time as reducing the number of file servers by half (from about 13 to about 6), which of course concentrates any locking problems. The problems are primarily affecting the www.stanford.edu servers, which are currently running OpenAFS 1.4.12.1 built 2011-05-31. The symptoms are (not all of which may be related): 1. The OpenAFS file servers are generally running at a much higher reported system load (as reported by uptime) than they were previously, although the higher load average is not consistent. 2. Running vos listvol -long -extended against a server causes the load average to shoot up to over 15 for as long as vos listvol is running. It's not clear whether this is correlated with client problems or the other symptoms below or not. Sometimes it seems to be, and sometimes the problems happen at different times. 3. The AFS file servers report periodic surges in client connections waiting for a thread. Previously, this was extremely rare and indicated a file server meltdown that was probably unrecoverable. Now, we're occasionally seeing spikes to 20, 50, even 80 clients waiting for a thread that persists for more than 30 seconds but then recovers by itself. It's also been frequently going over 100, at which point monitoring that we put in place during previous file server problems does a forced restart of the file server, which of course takes quite a long time. 4. The www.stanford.edu servers periodically block on AFS access and have their load shoot up to over 200. Normally they recover on their own after a few minutes. 5. When a file server has been forcibly restarted, sometimes the AFS clients on the www.stanford.edu servers will never recover. They go into an endless cycle of kernel errors and have to be forcibly rebooted in order to recover. (Unfortunately, I don't have one of those kernel errors handy, since it doesn't seem to be logged to syslog.) 6. We're seeing increasing numbers of kernel errors from other servers, particularly our filedrawers servers, reporting blocked processes (saying that the process was unable to make forward progress for more than X seconds) that are attempting to access AFS. 7. When looking at an rxdebug -allconn snapshot of the file server during one of these cases of large numbers of blocked connections, the only hosts that have more than four connections to the file server are the www.stanford.edu hosts, which frequently have up to 75. Note that our web infrastructure generates very large numbers of separate PAGs, since we use complete AFS and Kerberos isolation via suexec for most user CGI processes and therefore spawn a new PAG and AFS token for each incoming client request. 8. We are getting large numbers of the following error reported by our file servers: Wed Dec 7 17:14:45 2011 CallPreamble: Couldn't get CPS. Too many lockers By large, I mean that one server has seen 156 of those errors so far today. It's probably also worth noting that we continue to have the issue with AFS file servers, which we've had for years, that restarting a file server completely destroys AFS clients during the time period while the file server is attaching volumes. Between the point where the file server starts attaching volumes and finishes attaching volumes, any client that attempts to access those volumes ends up being swamped in processes in disk wait and usually essentially becomes inaccessible. We therefore block all access to the file server using iptables when restarting it and keep access blocked until all volumes are attached so that we can at least access data that's stored on other servers. -- Russ Allbery (r...@stanford.edu) <http://www.eyrie.org/~eagle/> _______________________________________________ OpenAFS-devel mailing list OpenAFS-devel@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-devel