On Wed, 07 Dec 2011 17:57:51 -0800 Russ Allbery <r...@stanford.edu> wrote:
> 5. When a file server has been forcibly restarted, sometimes the AFS > clients on the www.stanford.edu servers will never recover. They > go into an endless cycle of kernel errors and have to be forcibly > rebooted in order to recover. (Unfortunately, I don't have one of > those kernel errors handy, since it doesn't seem to be logged to > syslog.) Even if you don't have the exact messages... any recollection as to what they were? "blocked for more than X seconds" or something else? > 8. We are getting large numbers of the following error reported by our > file servers: > > Wed Dec 7 17:14:45 2011 CallPreamble: Couldn't get CPS. Too many lockers > > By large, I mean that one server has seen 156 of those errors so far > today. Yeah, I've been wondering lately if this is just from the larger amount of new connections you see on a particular server; due to the consolidation and the large number of pags/settokens you see. Every other time I've seen this it's either from a bug or client connection issues, but if you have enough completely new connections coming in, it would seem possible to just have it happen during the normal connection negotiation. It would be easy to make the host lock quota configurable (to adjust the number, or turn it off entirely), and you could see if that makes anything better. > It's probably also worth noting that we continue to have the issue > with AFS file servers, which we've had for years, that restarting a > file server completely destroys AFS clients during the time period > while the file server is attaching volumes. Between the point where > the file server starts attaching volumes and finishes attaching > volumes, any client that attempts to access those volumes ends up > being swamped in processes in disk wait and usually essentially > becomes inaccessible. We therefore block all access to the file > server using iptables when restarting it and keep access blocked until > all volumes are attached so that we can at least access data that's > stored on other servers. The current behavior is deliberate, and so is easy to change. The client currently waits for a VRESTARTING error to clear up; it's a simple matter of adding a client option to instead make it error out immediately, if that's what you want. That makes server restarts very visible to processes, though. (We had discussed server-side solutions/workarounds to this before, but I don't really think that's the right way.) -- Andrew Deason adea...@sinenomine.net _______________________________________________ OpenAFS-devel mailing list OpenAFS-devel@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-devel