John Almberg wrote: > Thanks for all the tips. At least I have something to start with. > > The guys in the data center reinstalled FreeBSD (the filesystem was > totally corrupted again), and then ran what they called "SMART test", > which might be smartctl, and said the hard drives look good. > > I am now able to get back in. > > So the system ran fine until I put a load on it with the database > (many transactions a second). This corrupted the file system again. > > So I guess I need to load it enough to produce error messages > (hopefully) but not enough to destroy the file system again. > > Sounds like fun :-( > > This is an Intel server, not a crummy white box, so hopefully it is > smart enough to monitor its own hardware at least a bit. We'll see. >
Just a tidbit or two. If it has an ICHR type South Bridge with what Intel calls "Matrix RAID" there has been reported problems with trying to use the RAID functionality. If you are not using the RAID make sure the data center guys are turning this off in the BIOS. Whenever I see these kinds of reports about data corruption correlating with SMART saying the drives are "good" I think disk controller. It does seem strange if the problem was not present previous to the "power fluctuations". But where hardware damage occurs can be funky. At least with the box I once had that took a direct lightning strike it was interesting to see where the lightening bounced around inside. If this is a 1u pizza box with only one power supply I would suspect the power supply of being damaged from the power problem. If it is a relatively low wattage unit then the damage sustained has created a situation where it doesn't have enough overhead to provide regulated pure DC when under full load. I remember a software company I worked for a few years stuck the old WORM drives in an HP Vectra desktop that only had a 135 watt power supply. You could see the power go all wonky with an oscilloscope as soon as that WORM drive started up, but the box worked well up until this point. At any rate, this all sounds like hardware to me. If it wasn't doing any of this before the so-called "power event" then I believe there has been hardware damage. Unless you are co-locating your own hardware it is the responsibility of the data center to provide you with functional hardware. After the first go around and the same problem resurfacing they should have yanked the box and just replaced it. Put a good one in service and troubleshoot the bad one off line. If they can't hold up their end of the deal you need to be looking somewhere else. -Mike _______________________________________________ firstname.lastname@example.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"