I'm having a problem with spontaneous restarts.  This isn't a new problem,
but I've done the obvious things and the problem hasn't gone away.  I
was thinking of asking on -hackers, but I'm trying here first.

The system is a 4.8 with a mix of patches and port upgrades of various
ages.  I'm planning to rebuild the whole thing, bringing it up to date,
but I'm hoping to be able to wait for a 5.x in STABLE; I don't want to do
this twice, since I expect I'll have to dump and restore everything.

The hardware is a 2.6 GHz P4 with 2 GByte of GEIL dual-channel memory.
(The problem existed on the previous, somewhat slower, memory as well.)
The box contains the processor and motherboard (Gigabyte GA-SINXP1394),
two floppy drives, CD and CD/W drives, an HP DAT, three IBM/Hitachi
36G/10K SCSI drives, and one 120G IDE.  The SCSI card is by Adaptec; the
video card is a low-end NVidia, and I'm running their video driver.  The
PS is an Antec True380, which should be enough for the box, with something
to spare.  There are several extra, large fans, of which more later.

The system, monitor, printer, and cable modem are all powered through an
APC BACK-UPS 450, about 18 months old.  It's shown in the last week that
it can keep things up for more than an hour.

The symptom is a restart that leaves no indication of how it happened.

  Recently, the system shut down (completely, and at the power supply)
  instead of restarting.  In that case, the last deliberate shutdown
  was a `shutdown -h now'; it appears that in every other case, the last
  deliberate shutdown was a `-r now'.  (Question: does the machine
  architecture have settings for reset-resume .vs. reset-halt, settings
  that might be remembered when a later action occurs?)  It has
  subsequently shut down with an immediate restart.

There are no failure indications in the /var/log/messages, nor reported
by dmesg.  (The console scrolls by very quickly.)  The message sequence
over the restart typically looks like this:

Jun  7 18:39:09 moleend /kernel: arp: moved from 00:05:00:e7:17:44
o 00:05:00:e7:17:57 on em0
Jun  7 18:39:09 moleend /kernel: arp: moved from 00:05:00:e7:17:57
o 00:05:00:e7:17:44 on em0
Jun  7 18:59:06 moleend dhclient: New Network Number:
Jun  7 18:59:06 moleend dhclient: New Broadcast Address:
Jun  7 22:47:33 moleend /kernel: Copyright (c) 1992-2003 The FreeBSD Project.
Jun  7 22:47:33 moleend /kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988,
9, 1991, 1992, 1993, 1994

The restart most often occurs AFTER X has been shut down (and often
restarted) but sometimes when X has not been run.  It most often occurs
when the system is under heavy CPU load, but sometimes when the load
has been light.

I thought at one time it might be a thermal problem and undertook to
fix that.  (I am still working to get more cooling air over the disks.)
Right now, I have 120 mm fans rated at 130-135 CFM (Panaflow and JMC)
pushing air in and out of the box, and pressurizing a duct feeding the
CPU cooler, which is now cool to the touch.  The memory modules are cool
to the touch.  While the disks need a proper plenum to route more air
over them, I no longer believe that there is a thermal problem.  The
vid card's fan-blown heatsink is warm (not hot) to the touch; the
northbridge's fan-blown heatsink is warm (not hot) to the touch.

(Some people commute to white-collar jobs in heavy pickups; I drive a
small server as my PC.  No chrome pipes.)

So: what should I do next?  Should I set the system up to go to the
kernel debugger on panic, or even start it via the kernel debugger?
(Where is the full documentation?)  Should I shell out for an even
bigger power supply?  Is there another log that I should examine?
A restart wire that I should check?  A power bus I should scope?
(I'll have to borrow a scope somewhere.)  Is it time for an exorcist?

Thanks for your help.

    Mark Terribile

Do you Yahoo!?
Yahoo! Mail is new and improved - Check it out!
[EMAIL PROTECTED] mailing list
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to