On Jan 31, 2011, at 12:17 PM, Stephen Joyce wrote:

> On Mon, 31 Jan 2011, Steve Simmons wrote:
> 
>> We have seen similar issues. It occurs when there is a given vice partition 
>> where lots of clients have registered callbacks but those clients are no 
>> longer accessible. Not all the clients have responded when the 1800 second 
>> timer goes off, and the fileserver goes down uncleanly.
>> 
>> We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is 
>> a combination of lengthening that timeout to a 3600 seconds and keeping our 
>> vice partitions no longer than 2TB. Active partitions are spread roughly 
>> equally across those 40 partitions. But that's just a stopgap; the longer a 
>> server stays up, the more likely it accumulates dead callbacks.
> 
> Assuming this is true, isn't this a good argument to keep the weekly server 
> process restarts?

Weekly outages, even if only for a few minutes per, are not acceptable here. 
Doing them less frequently starts to put us into the range of the timeout 
problems above.

At the moment most of our afs service processes have run happily for 237 days. 
That alone is a strong argument for not needing weekly restarts. If there are 
memory leaks, etc, they largely aren't affecting us since 

We mostly do restarts when we need to do software upgrades of one sort or 
another. They are typically done in a rolling fashion - upgrade the hot 
spare(s), vos move volumes to the hot spare(s), take down the vacated servers 
and upgrade, lather, rinse, repeat. At one point we went two years without a 
general AFS shutdown. We only got away from that due to bugs that required us 
to do OS upgrades more frequently or the entire cell at once. Life seems 
generally better with respect to those issues; and campus' opinion of the 
service is better when there are no perceived outages.

For the curious, we're running 1.4.12 with a couple of fixes we pulled forward 
from the 1.4.13 development stream. Barring new developments, the next one 
we'll give serious consideration to is 
1.6.X._______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to