Re: [Sisuite-devel] systemimager.ci.uchicago.edu down

2009-01-25 Thread Andrea Righi
On 2009-01-22 14:44, Brian Elliott Finley wrote:
 Andrea,
 

Hi Brian,

 Do you believe that we're simply running out of memory?  How much memory do 
 you think we need?
 
 Also, if I could replace the box, would it be OK if I went with a different 
 architecture?  Just considering the options...

mmmh.. surely the reason of the last 3-4 crashes was always an out of
memory (looking in the log). I don't know the details of this last down,
but it doesn't seem the same OOM problem. In general, in case of OOM,
the server is still ping-able and now it isn't. Probably something wrong
happened during the auto-reboot due to the panic_on_oops setting.

In any case, using a server with more memory probably would simply delay
the problem, but it wouldn't resolve it definitely. To fix this kind of
issues we must identify all the possible reasons and try to prevent
them... the check_oom.pl script reduced the crashes *a lot* (well...
except in the last days...), now the auto-reboot (panic_on_oops) seemed
like a good idea, but only if the reboot is a reliable operation.

The error message in the console would really help to understand what
happened.

BTW, a box with IPMI capability to remotely reboot it or even look at
the console (always remotely) would be *great*. I've no idea if it's
possible to have something like that or maybe if someone wants to donate
it. I can ask in CINECA if they've some spare boxes with this
capability.

-Andrea

 
 -Brian
 
  
 --Original Message--
 From: Andrea Righi
 To: supp...@ci.uchicago.edu
 Cc: Sisuite-devel
 Cc: Brian Finley
 Subject: [Sisuite-devel] systemimager.ci.uchicago.edu down
 Sent: Jan 22, 2009 2:54 AM
 
 Dear supp...@uchicago,
 
 the host systemimager.ci.uchicago.edu seems down (not ping-able nor
 telnet-able).
 
 In these past days we've had a lot of out-of-memory problems. Now
 we've configured the server to prevent OOM conditions (using a script
 that restarts apache when the memory is getting low) and in case the
 OOM can't be prevented the kernel automatically reboots after a OOM
 trace. Unfortunately this doesn't seem enough...
 
 Please, could you check if the server is down due to another reason
 (not OOM, I mean, if there's a console which is the message on the
 screen?) and try to manually reboot it?
 
 Many thanks,
 -Andrea
 

--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
sisuite-devel mailing list
sisuite-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sisuite-devel


Re: [Sisuite-devel] systemimager.ci.uchicago.edu down!

2007-01-18 Thread Brian Elliott Finley

Sorry it took a bit for me to respond.  Meetings all afternoon yesterday,
then a racquetball travel league match.

I'll contact the local admins at the physical site.

-Brian


On 1/17/07, Andrea Righi [EMAIL PROTECTED] wrote:


Brian,

the machine seems reachable with ping and telnet, but services do not
reply.. in practice everything's hanging...

Could you reboot / investigate?

Thanks,
-Andrea

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
sisuite-devel mailing list
sisuite-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sisuite-devel





--
--
Brian Elliott FinleyPhone:  630.631.6621
gpg --keyserver wwwkeys.pgp.net --recv-keys 10F8EE52
--
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___
sisuite-devel mailing list
sisuite-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sisuite-devel