Gregory K. Ruiz-Ade wrote:
So, in a similar vein to the other hardware debugging issues that have been discussed here, does anyone have recommendations for debugging when a system spontaneously reboots?

The basic result is as if someone hit the reset button. Now, unless the reset button on the case is actually triggering somehow (why would it?), what could be causing this?

It's driving me absolutely batty, because it started on Tuesday, apparently in the morning:

[EMAIL PROTECTED](pts/2):~ 4 > last reboot
reboot   system boot  2.6.20-16-generi Wed Oct  3 07:20 - 11:13  (03:53)
reboot   system boot  2.6.20-16-generi Wed Oct  3 00:56 - 00:57  (00:00)
reboot   system boot  2.6.20-16-generi Wed Oct  3 00:47 - 00:54  (00:06)
reboot   system boot  2.6.20-16-generi Wed Oct  3 00:31 - 00:31  (00:00)
reboot   system boot  2.6.20-16-generi Tue Oct  2 23:30 - 00:31  (01:01)
reboot   system boot  2.6.20-16-generi Tue Oct  2 23:19 - 23:29  (00:09)
reboot   system boot  2.6.20-16-generi Tue Oct  2 21:24 - 23:29  (02:05)
reboot   system boot  2.6.20-16-generi Tue Oct  2 11:47 - 23:29  (11:42)
reboot   system boot  2.6.20-16-generi Tue Oct  2 10:46 - 23:29  (12:42)
reboot   system boot  2.6.20-16-generi Tue Oct  2 09:46 - 23:29  (13:42)
reboot   system boot  2.6.20-16-generi Tue Oct  2 08:46 - 23:29  (14:43)
reboot   system boot  2.6.20-16-generi Tue Oct  2 07:46 - 23:29  (15:43)

I've checked last month's wtmp for reboot entries, and there's nothing aside from when I know I've restarted the system.

There are numerous things that can make a system reboot but the most common one I have come across is power supply problems. Despite the fact that you have a new system, all it takes is one little burp from the power supply to cause the reboot. Even if you have a brand name power supply, it is suspect.

Troubleshooting the power supply requires that you either borrow or make a power supply monitor. The monitor needs to check the +5VDC, the +3.3VDC and +12VDC lines. I don't think anything uses the -12VDC line any more except maybe the RS-232 serial port. The power monitor is basically a simple comparator that will switch and stay switched when the voltage drops below a preset value. I think about two dollars worth of parts could make one, not counting the connector cable.

To top it all off, this is my MythTV system, and these antics have interfered with recording schedules that we've pulled off of the TiVo (guess I visit the iTunes store?).

I've pulled the machine (new everything, all purchased, assembled and installed ~1 month ago) out of the cabinet, cleaned all the dust off the air intake screens, adjusted the intake fans up a notch (from "low" to "medium"; I love these Antec Tri-Cool fans) to increase airflow through the case. I went into the BIOS and told it to be more aggressive with the control of the CPU cooler's fan.

While I was there, I checked the temps. everything was ~30-33 C, but MCH and ICH zones were 67 and 66 C, respectively. Are these the CPU core temps? This is a Q6600 (i think) Core 2 Quad 2.4GHZ part. Am I not getting enough cooling? The heatsink fins are warm to touch, but not unbearably hot.

The temps are for the CPU, which comes from an on-die thermal sensor. These temperatures are normal for these chips. See the Intel Thermal Design Limit documents <http://www.intel.com/technology/itj/2006/volume10issue02/art03_Power_and_Thermal_Management/p03_power_management.htm>
<http://www.intel.com/cd/channel/reseller/asmo-na/eng/299986.htm>

I tried to install lm-sensors, but sensors-detect detected no sensors. It's possible the latest code from upstream might support my motherboard (Intel DP35DPM), but what's included with Ubuntu 7.04 does not. Didn't have time to investigate further.

lm_sensors goes by chipset, not motherboard. Check for the Intel P35 chipset.

Is my power supply glitched? This particular case orients the PSU so that it's fan draws air from directly outside the case via its own intake screen... This was caked with dust (hooray SoCal air!), which I wiped off the screen.

Cleanliness is next to godliness... I think I fail this one, but my computers are clean.

Do modern PSU's have a thermal cut-out that prevent them from powering up the box or cut power to a running system if it's too hot?

I would hope so. Although very few manufacturers will tell you how their power supplies work, PC Power and Cooling <http://www.pcpower.com/technology/> does a fairly good explanation.

I ran a memtest86+ overnight, and the RAM came up clean.

The vendor did a burn-in test on the CPU/Mobo/RAM combo before shipping it. It's been running non-stop for over a month without problem...

Might try pushing the CPU(s) a little more. Try the StressCPU program from here: <http://www.gromacs.org/component/option,com_docman/task,cat_view/gid,24/Itemid,26/> listed at the bottom of the page.

Do I just need to do a monthly dusting & blow-out? (Or move into an hermetically-sealed house?)

Wouldn't hurt. Although be aware of ear problems when going through the air lock :0

Frustrating.

Intermittent problems are always harder to fix than a complete failure. It would be nice if you could have a log of everything running over say a 24-HR period before a spontaneous reboot. For example, run a cron job every minute to document all running processes and monitor memory. Have the job call sync to flush the buffers to disk, or even better log everything remotely.

One thing to try is to set the BIOS to leave the computer off on power failure. Right now you can't distinguish between a reboot caused by power interruption vs a reset signal/hardware/software fault. Maybe you'll get lucky and find that the power is glitching and not the computer. I had this happen once when my neighbor was using an arc welder for about a week every night around 8 pm. Alternatively, see if you can borrow a UPS if you don't have one and see if it helps.

Gus


--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to