On Tue, 01 Feb 2011 12:04:08 -0800 Patricia O'Reilly <[email protected]> wrote:
> From what you have described it sounds to me like you need the patch > that Andrew referenced earlier that allows you to configure an > -offline-timeout and -offline-shutdown-timeout option on your > fileservers. We have has similar problems at our site and will be > releasing that patch into production shortly. Maybe, maybe not. I think the most common cause of this is just having too many volumes that can be shut down in 30 minutes. Determining this is easy; if it happens every single time you shut down the fileserver, that's probably it. (But obviously that's not fun to do.) But it could also be the 1.4.11 host package bugs; I don't know, and I just noted that cause to illustrate that there are several possible reasons. > Jeff Blaine wrote: > > > > Thanks for the replies. > > > > I can't at all fathom that our issue is one of existing > > client connections and callback break completion (timing out). I'd only say that if you have pretty good control over all of your clients. It's possible to see some really bizarre behavior (from the fileserver's point of view) from old clients or clients on oddly-behaving networks or NATs. > >> Also, in this specific case, it may not be just that shutting down > >> volumes took too long. 1.4.11 has known problems that can cause this > >> (e.g. the host list gets a loop in it, and something spins forever > >> trying to traverse the whole list). > > > > That's this, I think?: > > > > - Fixes to avoid issues cleaning up deleted hosts in > > the fileserver (126454) There were a few issues; all of the ones known to cause problems are included in 1.4.12. I don't have references for all of them off the top of my head, but I can get them for you if you want. > > Let's assume this issue is what caused our problem. I'm sort of at > > a loss as to how to approach OpenAFS versions. On one hand, > > expectations of more effort to make it clear in the release notes > > what items could cause something like unclean server shutdowns (kind > > of a big deal, IMO) are not really justifiable. This wasn't an issue causing fileserver shutdowns to hang and get killed, it was a general fileserver stability issue; that hang (or crash, or however it manifested; I've seen both) could happen at any time. And doing something like that actually isn't that difficult for at least most of the issues I am involved with. I already generally know which versions are affected for the bigger issues, so just writing that down would not be that hard. (But going back through all of the changes between 1.4.Z and 1.4 head would be a lot of work at this point) But that's not true for all changes, and I think it may be prohibitively difficult if we had to include information like that with every single change to the stable branch. I'm not sure how useful it is, though. In the specific case of the host list issues, the only meaningful thing I can say is that "sometimes the fileserver crashes". It's not really possible for you to know how susceptible you are to it (unless you get hit by it), because the circumstances required to trigger the crash are rather complex, and they involve access patterns of clients that you generally cannot control or even detect. > > It's open source, etc. On the other hand, it's not acceptable to > > blindly upgrade to the latest stable release every time it comes > > out. I understand that the most obvious take-away is just, "You got > > bit. Move on.", but if anything can improve on our end, I'd like to > > do that. Perhaps not right when it comes out, but it can be a good idea to move towards them, depending on how you do your risk/change management. Waiting a bit after each stable release for production machines makes sense, to see if unknown issues crop up, but if there are significant issues, you will hear about it if you are paying attention (probably in the form of a new release, fixing the issue). 1.4.12 was released almost a year ago, and I don't think there are any significant problems besides the issues that caused 1.4.14 to be released. There are some smaller issues here and there that sometimes get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would cause me to recommend rolling back to pre-1.4.12 if you had upgraded a machine to 1.4.12. -- Andrew Deason [email protected] _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
