[cc += ivan] On 05/31/09 15:37, Tomeu Vizoso wrote: > On Sun, May 31, 2009 at 15:09, Seth Woodworth <s...@isforinsects.com> wrote: >> No, I can reach the wiki just fine. I clicked around to a few pages I >> couldn't possibly have cached on my machine (user profiles, etc). > > solarsail was down earlier today, Bernie rebooted it and is fine now.
I dug the logs looking for the root cause of this outage, but I couldn't find conclusive evidence. Looks like an out of memory condition, where the machine was paging out everything to make room for a runaway process. But one would wonder why the OOM killer did not kick in to do its job. cjb could miraculously log in through ssh and issue an uptime command. The reported load average was 66 (we have 32 processors so it's not that incredible :-) Where can we put the blame? I'd like to point fingers at trac, but I'm not sure. The high load seems to imply *many* runaway processes, but trac's runs as a single-threaded application, unless I'm mistaken. The previous time, I remember seeing plenty of Apache instances in ps. Anyway, the frequency of these outages is about 2-3 months, and a forced reboot seems to fix it. If we can't figure it out this time, we don't necessarily have to start running around with hair on fire. WARNING: the current default kernel (vmlinuz-2.6.24-23-sparc64-smp) hangs immediately at boot. I had to pass linuxOLD to silo. We're now running 2.6.24-22-sparc64-smp. This needs further investigation. -- // Bernie Innocenti - http://codewiz.org/ \X/ Sugar Labs - http://sugarlabs.org/ _______________________________________________ IAEP -- It's An Education Project (not a laptop project!) IAEP@lists.sugarlabs.org http://lists.sugarlabs.org/listinfo/iaep