On Jan 28, 2011, at 1:58 PM, Jeff Blaine wrote: > On 1/28/2011 1:52 PM, Derrick Brashear wrote: >> did shutdown perchance take 30min? > > Yes. I found this in BosLog.old just now: > > Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 > Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 > Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 > Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown within > 1800 seconds > Wed Jan 26 12:58:37 2011: fs:file exited on signal 9
We have seen similar issues. It occurs when there is a given vice partition where lots of clients have registered callbacks but those clients are no longer accessible. Not all the clients have responded when the 1800 second timer goes off, and the fileserver goes down uncleanly. We have about 235,000 volumes spread across 40 vice partitions. Our 'fix' is a combination of lengthening that timeout to a 3600 seconds and keeping our vice partitions no longer than 2TB. Active partitions are spread roughly equally across those 40 partitions. But that's just a stopgap; the longer a server stays up, the more likely it accumulates dead callbacks. Two things I suspect but don't know for certain: Dynamic attach may help this a bit, simply because there will be fewer volumes attached and therefore fewer to detatch. I plan on trying this out soon. :-) I haven't read the code, but by observing the logfiles during a shutdown time it appears that fs shutdown break callbacks in a single-threaded manner per partition. This could probably be parallelized; simple thought experiments say X parallel callback breaks would result in run time T reduced to T/X. _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
