salvage knowledge

Steve Simmons Mon, 31 Jan 2011 08:55:07 -0800

On Jan 28, 2011, at 1:58 PM, Jeff Blaine wrote:

> On 1/28/2011 1:52 PM, Derrick Brashear wrote:
>> did shutdown perchance take 30min?
> 
> Yes.  I found this in BosLog.old just now:
> 
> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15
> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15
> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15
> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within 
> 1800 seconds
> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9


We have seen similar issues. It occurs when there is a given vice partition 
where lots of clients have registered callbacks but those clients are no longer 
accessible. Not all the clients have responded when the 1800 second timer goes 
off, and the fileserver goes down uncleanly.

We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is a 
combination of lengthening that timeout to a 3600 seconds and keeping our vice 
partitions no longer than 2TB. Active partitions are spread roughly equally 
across those 40 partitions. But that's just a stopgap; the longer a server 
stays up, the more likely it accumulates dead callbacks.

Two things I suspect but don't know for certain:

Dynamic attach may help this a bit, simply because there will be fewer volumes 
attached and therefore fewer to detatch. I plan on trying this out soon. :-)

I haven't read the code, but by observing the logfiles during a shutdown time 
it appears that fs shutdown break callbacks in a single-threaded manner per 
partition. This could probably be parallelized; simple thought experiments say 
X parallel callback breaks would result in run time T reduced to T/X.


_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] Re: Need volume state / fileserver / salvage knowledge

Reply via email to