> I've not see this type of problem before, so I
> turn to you guys.  Is this a sign that maybe
> a drive is going bad?  Or sign of bad memory?
> 
> What's going on here!?  I know it is almost
> Halloween and all, but this is kinda _spooky_
> to say the least.
> 
> 
> Idea? Please? :-)

Hard drives contain lots of moving parts, a known reliability risk.
Therefore most if not all modern hard disks and associated logic
contain more or less elaborate internal self-checking logic to detect
failing media, failing spindle motor, failing head positioning
mechanism, over and under voltage, bus driver failure, etc.  Most of
these will result in kernel messages and/or other obvious signs of
system distress.  Your "dmesg" (assuming it was done after the failed
build) doesn't show any evidence of such problem, so there's no reason
to suspect a hard disk going bad.

More likely possibilities are bad memory, a bad motherboard,
incompatible memory, bad disk controller, mis-configured bus speeds,
environmental problem, or possibly but less likely, a bad cpu.  Memory
is simple: if you buy a "consumer grade" home machine, you get memory
that has no self-check logic.  A chip going bad could well produce the
problems you show below.  A "server class" machine will nearly always
contain ECC memory.  A few companies (Dell, Sun) also make "commercial
grade" desktop machines, which usually also contain ECC.  Note that
most "home computer" stores and even many professionals don't understand
or value ECC memory, and will steer you away from such technology.

If it's memory, even without self-check logic that may still be easy to
see if it's broken.  "memcheck86+" has a good reputation.  This is a
stand-alone program, which you can leave running overnight.  If it
fails memcheck86+, then the problem is obvious.  If it passes, the
memory is still not in the clear; for instance, it's in theory possible
for the memory to fail when accessed by DMA but not by the processor.
If you can get the memory to fail more or less predictably, and you
have multiple memory modules, you may be able to play remove & swap
games to identify which module is bad.  Check your hardward doc first -
on some systems, modules may need to be paired in some particular
fashion.

It is certainly worth checking your machine for obvious physical
problems.  For instance, check air paths to ensure they aren't
blocked.  Be suspicious of burning smells, obvious heat, excessive fan
noise, or lack of distinct air flow.  Check the inside of the machine.
Is there excessive dust build-up?  Are the fan blades clean?  Do the
fans spin very smoothly and fairly freely?  Are the cables in the way?
Are there any loose cables?  Loose boards?  Bad solder joints or
cracks?  (On most modern motherboards, it's not worth spending much
time checking this if it's not easy to get to; removing the motherboard
may itself cause damage, and even a "large" crack sufficient to produce
complete failure may be nearly impossible to spot).  Other signs of
physical distress?  Ideally you want your machine to be in a
climate-controlled environment comfortable to people.  Dust, very dry
air, excessive moisture, temperature cycles, etc. are all bad.
Electrically conductive dust can become particularly exciting.

An older or fancier machine may have a separate disk controller, in
which case if you have a spare it may be worth swapping.  Your machine
is probably not one of these.

On many newer machines, the BIOS can contain settings which alter the
speed or timing of various bus components.  Getting this wrong can
produce subtle weirdness, or obvious and drammatic signs of failure.
It may take a while for subtle weirdness to manifest itself in any
obvious fashion.  If you have ECC memory, make sure the bios knows that.

Sorting all this out can take time.  If the machine is an older one, it
may be cheaper to replace it than figure out what failed.

Also, in case you missed it, building large software packages is
an excellent way to burn a new machines in or establish
that an existing machine is reliable.  :-)

                                -Marcus

Reply via email to