On Wed, 2008-10-01 at 16:36 -0400, Stephen Clark wrote: > Robert Watson wrote: > > On Wed, 1 Oct 2008, Gary Palmer wrote: > > > >> "ps alxw" may be of interest in addition to "ps auxw" as it displays > >> what the processes are waiting on. It could conceivably be a problem > >> of some kind at the filesystem level. I've seen situations before > >> where a problem escalates to the point where "ls /" hangs, and at that > >> point you're stuck with an unresponsive box. > > > > If you want an even greater level of detail than ps -l, you can use > > procstat -k to generate kernel stack traces for all user/kernel > > threads. Wait channels are very useful, but they only tell you what the > > code that invoked the wait thinks it is for, not how that code was > > reached. A classic example is waiting on an exhausted UMA zone -- you > > get a uma wait channel, but no indication of what subsystem performed > > the memory allocation... This required FreeBSD 7.1 and higher, > > however. (Obviously, the same can be done easily using DDB, but that's > > hard on a box without a serial console, and requires interrupting the > > flow of the operating system, compiling with DDB, etc). > > > > Robert N M Watson > > Computer Laboratory > > University of Cambridge > > > A big part of problem is this seems to take about 100 days of uptime to > occur. > We have some inhouse test boxes but have never seen the problem, probably > because non of them have been up more than about 45 days. The units in the > field, of which there is about 300, are headless and none are physically > close. > > When the boxes are rebooted there are no error messages in any of the log > files, > only the absence of information that would normally be logged by new > processes > that would be spawned. We are getting ready to install a patch that will try > to > gather more information. > > I thought about writing an app the would try to fork a child periodically and > record in a log file if there was an error. But EAGAIN is nonspecific as to > the > real reason the fork failed. I was looking for some way to periodically log > the > resources that would cause the fork failure. > > procstat -k looks like it would have been a good candidate but unfortunately > we > are running 6.1. > > Thanks for the response. > Steve
I have a VIA EPIA-based system that was rebooting and not leaving behind any diagnosable evidence that I could find. Attaching a serial console revealed a kernel-trap which was double-faulting as it went to write the kernel dump. I've not yet had the opportunity to investigate further except that out of desperation I threw in an additional 64M of RAM - all I had to hand - adding to its 256M and I haven't seen it fault again in the 37 days since (it would often stay up for less than a day before that). I wonder whether it would be worth your while running a bench unit with limited RAM, either physically or via the hw.physmem tunable. I would probably try to identify the amount of RAM that just allows it to run "normally", ideally subjecting it to a typical workload if possible. If it bombs after running for a reasonable length of time, add back a fraction of the unused memory and see if it then stays up proportionally longer which could be indicative of a memory starvation issue. If you can get it to bomb in the above scenario then you can probably get some insight into where it's failing. Having said that, I should point out that I've not actually used the above technique so I may well be overlooking something which might prevent it from being useful or indeed from "working" at all. Wayne _______________________________________________ [email protected] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
