Gregory K. Ruiz-Ade wrote:
So, in a similar vein to the other hardware debugging issues that have
been discussed here, does anyone have recommendations for debugging when
a system spontaneously reboots?
The basic result is as if someone hit the reset button. Now, unless the
reset button on the case is actually triggering somehow (why would it?),
what could be causing this?
It's driving me absolutely batty, because it started on Tuesday,
apparently in the morning:
[EMAIL PROTECTED](pts/2):~ 4 > last reboot
reboot system boot 2.6.20-16-generi Wed Oct 3 07:20 - 11:13 (03:53)
reboot system boot 2.6.20-16-generi Wed Oct 3 00:56 - 00:57 (00:00)
reboot system boot 2.6.20-16-generi Wed Oct 3 00:47 - 00:54 (00:06)
reboot system boot 2.6.20-16-generi Wed Oct 3 00:31 - 00:31 (00:00)
reboot system boot 2.6.20-16-generi Tue Oct 2 23:30 - 00:31 (01:01)
reboot system boot 2.6.20-16-generi Tue Oct 2 23:19 - 23:29 (00:09)
reboot system boot 2.6.20-16-generi Tue Oct 2 21:24 - 23:29 (02:05)
reboot system boot 2.6.20-16-generi Tue Oct 2 11:47 - 23:29 (11:42)
reboot system boot 2.6.20-16-generi Tue Oct 2 10:46 - 23:29 (12:42)
reboot system boot 2.6.20-16-generi Tue Oct 2 09:46 - 23:29 (13:42)
reboot system boot 2.6.20-16-generi Tue Oct 2 08:46 - 23:29 (14:43)
reboot system boot 2.6.20-16-generi Tue Oct 2 07:46 - 23:29 (15:43)
I've checked last month's wtmp for reboot entries, and there's nothing
aside from when I know I've restarted the system.
There are numerous things that can make a system reboot but the most
common one I have come across is power supply problems. Despite the fact
that you have a new system, all it takes is one little burp from the
power supply to cause the reboot. Even if you have a brand name power
supply, it is suspect.
Troubleshooting the power supply requires that you either borrow or make
a power supply monitor. The monitor needs to check the +5VDC, the
+3.3VDC and +12VDC lines. I don't think anything uses the -12VDC line
any more except maybe the RS-232 serial port. The power monitor is
basically a simple comparator that will switch and stay switched when
the voltage drops below a preset value. I think about two dollars worth
of parts could make one, not counting the connector cable.
To top it all off, this is my MythTV system, and these antics have
interfered with recording schedules that we've pulled off of the TiVo
(guess I visit the iTunes store?).
I've pulled the machine (new everything, all purchased, assembled and
installed ~1 month ago) out of the cabinet, cleaned all the dust off the
air intake screens, adjusted the intake fans up a notch (from "low" to
"medium"; I love these Antec Tri-Cool fans) to increase airflow through
the case. I went into the BIOS and told it to be more aggressive with
the control of the CPU cooler's fan.
While I was there, I checked the temps. everything was ~30-33 C, but
MCH and ICH zones were 67 and 66 C, respectively. Are these the CPU
core temps? This is a Q6600 (i think) Core 2 Quad 2.4GHZ part. Am I
not getting enough cooling? The heatsink fins are warm to touch, but
not unbearably hot.
The temps are for the CPU, which comes from an on-die thermal sensor.
These temperatures are normal for these chips. See the Intel Thermal
Design Limit documents
<http://www.intel.com/technology/itj/2006/volume10issue02/art03_Power_and_Thermal_Management/p03_power_management.htm>
<http://www.intel.com/cd/channel/reseller/asmo-na/eng/299986.htm>
I tried to install lm-sensors, but sensors-detect detected no sensors.
It's possible the latest code from upstream might support my motherboard
(Intel DP35DPM), but what's included with Ubuntu 7.04 does not. Didn't
have time to investigate further.
lm_sensors goes by chipset, not motherboard. Check for the Intel P35
chipset.
Is my power supply glitched? This particular case orients the PSU so
that it's fan draws air from directly outside the case via its own
intake screen... This was caked with dust (hooray SoCal air!), which I
wiped off the screen.
Cleanliness is next to godliness... I think I fail this one, but my
computers are clean.
Do modern PSU's have a thermal cut-out that
prevent them from powering up the box or cut power to a running system
if it's too hot?
I would hope so. Although very few manufacturers will tell you how their
power supplies work, PC Power and Cooling
<http://www.pcpower.com/technology/> does a fairly good explanation.
I ran a memtest86+ overnight, and the RAM came up clean.
The vendor did a burn-in test on the CPU/Mobo/RAM combo before shipping
it. It's been running non-stop for over a month without problem...
Might try pushing the CPU(s) a little more. Try the StressCPU program
from here:
<http://www.gromacs.org/component/option,com_docman/task,cat_view/gid,24/Itemid,26/>
listed at the bottom of the page.
Do I just need to do a monthly dusting & blow-out? (Or move into an
hermetically-sealed house?)
Wouldn't hurt. Although be aware of ear problems when going through the
air lock :0
Frustrating.
Intermittent problems are always harder to fix than a complete failure.
It would be nice if you could have a log of everything running over say
a 24-HR period before a spontaneous reboot. For example, run a cron job
every minute to document all running processes and monitor memory. Have
the job call sync to flush the buffers to disk, or even better log
everything remotely.
One thing to try is to set the BIOS to leave the computer off on power
failure. Right now you can't distinguish between a reboot caused by
power interruption vs a reset signal/hardware/software fault. Maybe
you'll get lucky and find that the power is glitching and not the
computer. I had this happen once when my neighbor was using an arc
welder for about a week every night around 8 pm. Alternatively, see if
you can borrow a UPS if you don't have one and see if it helps.
Gus
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list