Am 23.06.2011 17:44, schrieb Andrew Deason: > On Mon, 20 Jun 2011 16:19:01 -0700 > Jonathan Nilsson <[email protected]> wrote: > >> i suspect that the system kept spawning httpd processes as old ones >> got blocked and eventually it ran out of memory and became >> unresponsive. after a reboot it works fine. so the question is, what >> caused the afs cache manager to respond so slow? >> >> can anyone confirm if they have seen kernel messages like this? how >> can i confirm if the problem is with the client or the server? i see >> no error messages in BosLog, FileLog, or VolserLog on our servers... > > If the processes were hanging forever or for a very long time, it's not > likely to be the fault of any server, since the client doesn't wait > around forever for a response. I assume there were no messages about > losing contact with file or vl servers in the client logs around that > time? > > It's easier to see what's going on if we know what's going on with the > rest of the system when that happens. If you ever catch it doing that, > running 'echo t > /proc/sysrq-trigger' will generate a lot of info (some > of it useful) in syslog. Or if you can get the machine to dump core, > that's the most useful thing, but you don't want to just go giving that > out to anybody. >
We see this kind of ghosts from time to time on our web servers (RHEL5, VMware). But we don't get the kernel messages. Only the usual 'lost contact with file server' messages. It's of course not a problem with the server because we can cd into the path on other client systems. Today we had a load of about 250 on a 2 cpu VM. 'fs flushvolume' from the root along the tree of data for this webserver fixed the problem. What happens is that the apache tries to deliver a file and runs into afs timeouts. This happens to one process after the other and one new process after the other is forked until the internal limit of 256 apache instances is reached.. We don't know how to debug the problem yet. On some machines we're running up to 30 apache instances for smaller websites. Each instance is wrapped with kauth and an seperate srvtab to isolate the apaches from each other in afs. And each apache uses a different user. I don't know if this causes problems with the token handling in the kernel. The kernel keyring size isn't a problem anymore since RHEL 5.5. Regards Berthold Cogel -- Dipl. Chem. Dr. Berthold Cogel University of Cologne E-Mail: [email protected] Regionales Rechenzentrum (RRZK) Tel.: +49(0)221/470-7873 Robert-Koch-Str. 10 FAX: +49(0)221/478-86845 D-50931 Cologne - Germany _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
