Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread Nifty Cluster Mitch
On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote: On Sat, Nov 15, 2008 at 7:26 PM, Vandaman [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: We have a server which locks up about once a week (for the past 3 .. How do I debug the server, which runs CentOS 5.2 to see why it

Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread Rudi Ahlers
On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch [EMAIL PROTECTED] wrote: On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote: On Sat, Nov 15, 2008 at 7:26 PM, Vandaman [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: We have a server which locks up about once a week (for the

Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread Rudi Ahlers
On Thu, Nov 20, 2008 at 10:27 AM, Rudi Ahlers [EMAIL PROTECTED] wrote: On Thu, Nov 20, 2008 at 10:09 AM, Nifty Cluster Mitch [EMAIL PROTECTED] wrote: On Sat, Nov 15, 2008 at 08:13:24PM +0200, Rudi Ahlers wrote: On Sat, Nov 15, 2008 at 7:26 PM, Vandaman [EMAIL PROTECTED] wrote: Rudi Ahlers

Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread John R Pierce
Rudi Ahlers wrote: This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed. The fan that I have, is: http://www.dynatron-corp.com/products/cpucooler/cpucooler_model.asp?id=165 So, it looks like it's not really made for a Q9300 CPU, although

Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread Rudi Ahlers
On Thu, Nov 20, 2008 at 10:38 AM, John R Pierce [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: This is when I realized that the Q9300 CPU could be too big a processor for the fan that I have installed. The fan that I have, is:

Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread Kai Schaetzl
Rudi Ahlers wrote on Thu, 20 Nov 2008 10:30:53 +0200: Top reported load to be 12 - 15, which is normally still workable, but with the overheating CPU, I couldn't do a thing. If it's overheating there should be two things telling you this: - sensors - throttled CPU speed Something you can

Re: [CentOS] how to debug hardware lockups?

2008-11-20 Thread Rudi Ahlers
On Thu, Nov 20, 2008 at 1:31 PM, Kai Schaetzl [EMAIL PROTECTED] wrote: Rudi Ahlers wrote on Thu, 20 Nov 2008 10:30:53 +0200: Top reported load to be 12 - 15, which is normally still workable, but with the overheating CPU, I couldn't do a thing. If it's overheating there should be two things

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Les Mikesell
Rudi Ahlers wrote: On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info? tough to write to the disk when the kernel is

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Rudi Ahlers
I had machine that would crash about once every week or two in normal operation. Memtest86+ found an error in the 2nd day of running. The worst part was that it left the raid mirrors in a strange state that caused occasional problems for months even after replacing the RAM. -- Did you

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Les Mikesell
Rudi Ahlers wrote: I had machine that would crash about once every week or two in normal operation. Memtest86+ found an error in the 2nd day of running. The worst part was that it left the raid mirrors in a strange state that caused occasional problems for months even after replacing the RAM.

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Rob Lines
On Tue, Nov 18, 2008 at 9:47 AM, Les Mikesell [EMAIL PROTECTED] wrote: Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles would be good enough? I'm hoping to pick-up the server in the next 2 hours then I can see what happens when I run memtest86+ or other tests Yes,

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread nate
Les Mikesell wrote: Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something. Don't forget cosmic rays http://adsabs.harvard.edu/abs/1978ITNS...25.1166P nate

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Les Mikesell
nate wrote: Les Mikesell wrote: Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something. Don't forget cosmic rays http://adsabs.harvard.edu/abs/1978ITNS...25.1166P Yeah, but those don't

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Matthew Kent
On Sat, 2008-11-15 at 21:59 +0200, Rudi Ahlers wrote: That machine doesn't have a serial port (why do vendors think serial ports are obsolete), so is there any other way to send to logs to a different machine then? You can send it to another machines syslogd with netconsole. Checkout

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread Ross Walker
On Nov 18, 2008, at 6:05 PM, Les Mikesell [EMAIL PROTECTED] wrote: nate wrote: Les Mikesell wrote: Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are stored - or when the moon is in a certain phase or something. Don't forget cosmic rays

Re: [CentOS] how to debug hardware lockups?

2008-11-18 Thread nate
Ross Walker wrote: Ah, memory mapped files, another very good reason to use ECC with large memory machines. Normal ECC doesn't seem to be all that great IMO, though I have been very impressed with HP's Advanced ECC it seems much more resilient to memory errors. Bad ram has been my #1 source

[CentOS] how to debug hardware lockups?

2008-11-15 Thread Rudi Ahlers
Hi, We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well. How do I debug the server, which runs CentOS 5.2 to see why it locks up?

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread Richard Karhuse
On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers [EMAIL PROTECTED] wrote: Hi, We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread Rudi Ahlers
On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse [EMAIL PROTECTED] wrote: On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers [EMAIL PROTECTED] wrote: Hi, We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread Vandaman
Rudi Ahlers wrote: We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well. How do I debug the server, which runs CentOS 5.2

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread Rudi Ahlers
On Sat, Nov 15, 2008 at 7:26 PM, Vandaman [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread nate
Rudi Ahlers wrote: Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread Rudi Ahlers
On Sat, Nov 15, 2008 at 8:17 PM, nate [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread John R Pierce
Rudi Ahlers wrote: Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info? tough to write to the disk when the kernel is crashing. ditto the network. that leaves VGAs and serial ports, which can be written to

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread Rudi Ahlers
On Sun, Nov 16, 2008 at 1:14 AM, John R Pierce [EMAIL PROTECTED] wrote: Rudi Ahlers wrote: Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this, or where else can I get more info? tough to write to the disk when the kernel is crashing. ditto the

Re: [CentOS] how to debug hardware lockups?

2008-11-15 Thread John R Pierce
Rudi Ahlers wrote: No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC - http://www.intel.com/products/desktop/motherboards/DG35EC/DG35EC-overview.htm midrange business desktop board. I use a DG33TL as my desktop, same thing.