> I've not see this type of problem before, so I > turn to you guys. Is this a sign that maybe > a drive is going bad? Or sign of bad memory? > > What's going on here!? I know it is almost > Halloween and all, but this is kinda _spooky_ > to say the least. > > > Idea? Please? :-)
Hard drives contain lots of moving parts, a known reliability risk. Therefore most if not all modern hard disks and associated logic contain more or less elaborate internal self-checking logic to detect failing media, failing spindle motor, failing head positioning mechanism, over and under voltage, bus driver failure, etc. Most of these will result in kernel messages and/or other obvious signs of system distress. Your "dmesg" (assuming it was done after the failed build) doesn't show any evidence of such problem, so there's no reason to suspect a hard disk going bad. More likely possibilities are bad memory, a bad motherboard, incompatible memory, bad disk controller, mis-configured bus speeds, environmental problem, or possibly but less likely, a bad cpu. Memory is simple: if you buy a "consumer grade" home machine, you get memory that has no self-check logic. A chip going bad could well produce the problems you show below. A "server class" machine will nearly always contain ECC memory. A few companies (Dell, Sun) also make "commercial grade" desktop machines, which usually also contain ECC. Note that most "home computer" stores and even many professionals don't understand or value ECC memory, and will steer you away from such technology. If it's memory, even without self-check logic that may still be easy to see if it's broken. "memcheck86+" has a good reputation. This is a stand-alone program, which you can leave running overnight. If it fails memcheck86+, then the problem is obvious. If it passes, the memory is still not in the clear; for instance, it's in theory possible for the memory to fail when accessed by DMA but not by the processor. If you can get the memory to fail more or less predictably, and you have multiple memory modules, you may be able to play remove & swap games to identify which module is bad. Check your hardward doc first - on some systems, modules may need to be paired in some particular fashion. It is certainly worth checking your machine for obvious physical problems. For instance, check air paths to ensure they aren't blocked. Be suspicious of burning smells, obvious heat, excessive fan noise, or lack of distinct air flow. Check the inside of the machine. Is there excessive dust build-up? Are the fan blades clean? Do the fans spin very smoothly and fairly freely? Are the cables in the way? Are there any loose cables? Loose boards? Bad solder joints or cracks? (On most modern motherboards, it's not worth spending much time checking this if it's not easy to get to; removing the motherboard may itself cause damage, and even a "large" crack sufficient to produce complete failure may be nearly impossible to spot). Other signs of physical distress? Ideally you want your machine to be in a climate-controlled environment comfortable to people. Dust, very dry air, excessive moisture, temperature cycles, etc. are all bad. Electrically conductive dust can become particularly exciting. An older or fancier machine may have a separate disk controller, in which case if you have a spare it may be worth swapping. Your machine is probably not one of these. On many newer machines, the BIOS can contain settings which alter the speed or timing of various bus components. Getting this wrong can produce subtle weirdness, or obvious and drammatic signs of failure. It may take a while for subtle weirdness to manifest itself in any obvious fashion. If you have ECC memory, make sure the bios knows that. Sorting all this out can take time. If the machine is an older one, it may be cheaper to replace it than figure out what failed. Also, in case you missed it, building large software packages is an excellent way to burn a new machines in or establish that an existing machine is reliable. :-) -Marcus