Just a thought - does the file server maintain any state information for authenticated users?
Could this be a case of NUM-HOSTS * NUM-USERS * NUM-SEPARATE-PAGS = LARGE NUMBER? I see all the code the increments/decrements CEs in viced/host.c. Would be nice to have a kdump equivalent for the file server process. -- Nathan > -----Original Message----- > From: Neulinger, Nathan [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, October 02, 2001 1:04 PM > To: '[EMAIL PROTECTED]' > Subject: [OpenAFS-devel] ARGH... afs outage... and same > stinking client > bug that it doesn' t ever see it... > > > We had another afs outage here, and once again, the same client bug is > causing the clients to never see the failure. > > This was the same fileserver bug that I reported a few weeks > ago where it is > accumulating hundreds/millions of host entries: > > 37653 host_NumHostEntries > 74 host_HostBlocks > 1533 host_NonDeletedHosts > 1508 host_HostsInSameNetOrSubnet > 3 host_HostsInDiffSubnet > 22 host_HostsInDiffNetwork > 1172202 host_NumClients > 16058 host_ClientBlocks > > That numclients is getting HUGE, and the file server is > sucking larger and > larger amounts of memory. > > (Feel free to run xstat against afs[1-10].umr.edu. (Some are db only > though.) Interestingly I also see this: > > -1090519756 rx_bogusHost > > (Output from xstat_fs_test afs4 1) > > On a server that has been rebooted recently due to this > problem, I saw 594 > million on the bogusHost, and 293 thousand on the numClients. > Somewhere > there is a bad leak. Anyone else seen anything like this? > > We're contemplating re-enabling periodic (maybe monthly) > server restarts at > the moment, but would rather have a better fix. > > As far as the client bug - which appears to occur on several different > platforms - basically they just hang. They don't time out and > see the file > server go down, or anything. Now - the instant I kill that file > server/reboot/firewall it, the clients ALL break loose > immediately. The > problem is basically that all of the afs clients are > completely hung and > won't respond to much of anything. This means that a single > afs server going > down in this way negates all benefit of replicated volumes. > > I have never been able to reproduce this symptom by > suspending/firewalling/etc. a file server, the clients all see it > immediately. > > If someone can give me any ideas on how I might reproduce this failure > symptom (i.e. dropped packets, whatever) I have a test cell > that I will use > to see about diagnosing the client and server, but at the > moment, I do not > have any way of reproducing the symptom. Whenever I've tried > anything, the > client always immediately sees the failure. What situation > would cause the > client to hang and not time out - there must be a particular > location in the > cache manager code where that situation can occur. (Or is it > a case where > it's semi-responding, but not enough to cause the client to > break loose - if > so, I wonder if there is some way to cause the client to be > more sensitive > and have a greater tendency to drop the server?) > > -- Nathan > > ------------------------------------------------------------ > Nathan Neulinger EMail: [EMAIL PROTECTED] > University of Missouri - Rolla Phone: (573) 341-4841 > Computing Services Fax: (573) 341-4216 > _______________________________________________ > OpenAFS-devel mailing list > [EMAIL PROTECTED] > https://lists.openafs.org/mailman/listinfo/openafs-devel > _______________________________________________ OpenAFS-devel mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-devel
