On Jan 31, 2011, at 12:36 PM, Andrew Deason wrote: > On Mon, 31 Jan 2011 11:54:24 -0500 > Steve Simmons <[email protected]> wrote: > >>> Wed Jan 26 12:28:13 2011: upclientetc exited on signal 15 >>> Wed Jan 26 12:28:13 2011: upclientbin exited on signal 15 >>> Wed Jan 26 12:28:24 2011: fs:vol exited on signal 15 >>> Wed Jan 26 12:58:19 2011: bos shutdown: fileserver failed to shutdown >>> within 1800 seconds >>> Wed Jan 26 12:58:37 2011: fs:file exited on signal 9 >> >> We have seen similar issues. It occurs when there is a given vice >> partition where lots of clients have registered callbacks but those >> clients are no longer accessible. Not all the clients have responded >> when the 1800 second timer goes off, and the fileserver goes down >> uncleanly. > > Also, in this specific case, it may not be just that shutting down > volumes took too long. 1.4.11 has known problems that can cause this > (e.g. the host list gets a loop in it, and something spins forever > trying to traverse the whole list).
Yeah, we got seriously bit by that bug. But not just on shutdowns; eventually the list would be so corrupt the processes would actually crash. Dan Hyde spent a lot of time on that; it's why we're running 1.4.12 with a couple of patches currently. 'Fixing' that bug by regular server restarts is an argument for those restarts. But we were seeing the 1800 second timeout on shutdown at least back to 1.4.8. Based on our experience with earlier versions, the host list corruption issue didn't surface until post-1.4.8. Or at least, not as badly. Steve_______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
