salvage knowledge

Steve Simmons Mon, 07 Feb 2011 10:23:06 -0800

On Feb 1, 2011, at 3:58 PM, Andrew Deason wrote:

> On Tue, 01 Feb 2011 12:04:08 -0800
> Patricia O'Reilly <[email protected]> wrote:
> 
>> From what you have described it sounds to me like you need the patch
>> that Andrew referenced earlier that allows you to configure an
>> -offline-timeout and -offline-shutdown-timeout option on your
>> fileservers. We have has similar problems at our site and will be
>> releasing that patch into production shortly.
> 
> Maybe, maybe not. I think the most common cause of this is just having
> too many volumes that can be shut down in 30 minutes. Determining this
> is easy; if it happens every single time you shut down the fileserver,
> that's probably it. (But obviously that's not fun to do.)
> 
> But it could also be the 1.4.11 host package bugs; I don't know, and I
> just noted that cause to illustrate that there are several possible
> reasons.


As noted earlier, we saw this at least back to our use of 1.4.8. Prior to that 
we'd being doing rolling restarts - ie, moving all the volumes off a server 
before restarting it. So it may have been present earlier, but we simply didn't 
hit it.

> 
>> Jeff Blaine wrote:
>>> 
>>> Thanks for the replies.
>>> 
>>> I can't at all fathom that our issue is one of existing
>>> client connections and callback break completion (timing out).
> 
> I'd only say that if you have pretty good control over all of your
> clients. It's possible to see some really bizarre behavior (from the
> fileserver's point of view) from old clients or clients on
> oddly-behaving networks or NATs.

Seconded. A number of our more savvy users (or users who have savvy IT admins) 
run AFS at home, another large batch of folks are behind nats/firewalls, and a 
third small group are alumni or ex-staff who use their AFS space from all over 
the world. As a proportion of overall users that's fairly small, but as a 
proportion of folks whose hosts time out during shutdown it's pretty large.

>>> Let's assume this issue is what caused our problem.  I'm sort of at
>>> a loss as to how to approach OpenAFS versions.  On one hand,
>>> expectations of more effort to make it clear in the release notes
>>> what items could cause something like unclean server shutdowns (kind
>>> of a big deal, IMO) are not really justifiable.
> 
> This wasn't an issue causing fileserver shutdowns to hang and get
> killed, it was a general fileserver stability issue; that hang (or
> crash, or however it manifested; I've seen both) could happen at any
> time.

There two things which seemed to make the problem more likely - having the 
server up for a long time, and having lots of different hosts using volumes 
from that server. We did find a log entry that was usually a symptom of the 
problem about to occur, but once that entry appeared it was too late to fix it 
- either the server would crash or would get into an infinite loop in the next 
few minutes to hours. Attempting to restart the server once we'd seen it always 
tickled the bug; attaching to the process w/gdb and forcing a core dump was how 
we finally diagnosed the bloody thing.

>>> It's open source, etc.  On the other hand, it's not acceptable to
>>> blindly upgrade to the latest stable release every time it comes
>>> out. I understand that the most obvious take-away is just, "You got
>>> bit. Move on.", but if anything can improve on our end, I'd like to
>>> do that.
> 
> Perhaps not right when it comes out, but it can be a good idea to move
> towards them, depending on how you do your risk/change management.
> Waiting a bit after each stable release for production machines makes
> sense, to see if unknown issues crop up, but if there are significant
> issues, you will hear about it if you are paying attention (probably in
> the form of a new release, fixing the issue).
> 
> 1.4.12 was released almost a year ago, and I don't think there are any
> significant problems besides the issues that caused 1.4.14 to be
> released. There are some smaller issues here and there that sometimes
> get hit, but there's no fix on the 1.4.x branch for 1.4.12 that would
> cause me to recommend rolling back to pre-1.4.12 if you had upgraded a
> machine to 1.4.12.

1.4.12 been bery bery good to me; there's no fix in .13/.14 that seems to 
affect us. Right now we're gearing up to build a test host for the latest 1.6 
release candidate. Barring some disastrous newfound issue with 1.4.12, 1.6 
makes more sense. As noted earlier in this discussion, dynamic attach looks 
like a fix for shutdown/restart timing issues.

Steve_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Reply via email to