On 2009-01-22 14:44, Brian Elliott Finley wrote:
> Andrea,
> 

Hi Brian,

> Do you believe that we're simply running out of memory?  How much memory do 
> you think we need?
> 
> Also, if I could replace the box, would it be OK if I went with a different 
> architecture?  Just considering the options...

mmmh.. surely the reason of the last 3-4 crashes was always an out of
memory (looking in the log). I don't know the details of this last down,
but it doesn't seem the same OOM problem. In general, in case of OOM,
the server is still ping-able and now it isn't. Probably something wrong
happened during the auto-reboot due to the panic_on_oops setting.

In any case, using a server with more memory probably would simply delay
the problem, but it wouldn't resolve it definitely. To fix this kind of
issues we must identify all the possible reasons and try to prevent
them... the check_oom.pl script reduced the crashes *a lot* (well...
except in the last days...), now the auto-reboot (panic_on_oops) seemed
like a good idea, but only if the reboot is a reliable operation.

The error message in the console would really help to understand what
happened.

BTW, a box with IPMI capability to remotely reboot it or even look at
the console (always remotely) would be *great*. I've no idea if it's
possible to have something like that or maybe if someone wants to donate
it. I can ask in CINECA if they've some spare boxes with this
capability.

-Andrea

> 
> -Brian
> 
>  
> ------Original Message------
> From: Andrea Righi
> To: supp...@ci.uchicago.edu
> Cc: Sisuite-devel
> Cc: Brian Finley
> Subject: [Sisuite-devel] systemimager.ci.uchicago.edu down
> Sent: Jan 22, 2009 2:54 AM
> 
> Dear supp...@uchicago,
> 
> the host systemimager.ci.uchicago.edu seems down (not ping-able nor
> telnet-able).
> 
> In these past days we've had a lot of out-of-memory problems. Now
> we've configured the server to prevent OOM conditions (using a script
> that restarts apache when the memory is getting low) and in case the
> OOM can't be prevented the kernel automatically reboots after a OOM
> trace. Unfortunately this doesn't seem enough...
> 
> Please, could you check if the server is down due to another reason
> (not OOM, I mean, if there's a console which is the message on the
> screen?) and try to manually reboot it?
> 
> Many thanks,
> -Andrea
> 

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
sisuite-devel mailing list
sisuite-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sisuite-devel

Reply via email to