Dieter wrote:
At work I've got a server with an LSI MegaRAID (dmesg below) that suddenly seems to be killing hard drives. Last Thursday I had one drive fail, and the system didn't begin rebuilding onto the hot spare until I rebooted.

I would hope that the controller isn't killing drives.


Me, too.  Or the enclosure.

Can we presume the system has clean power, temps are ok, no vibration, etc. ?

Yes, all power is through an MGE Pulsar Evolution. The server is rack mounted, and sysctl reports all temps as normal.

[EMAIL PROTECTED]:/home/jross $ sysctl -a | grep hw

hw.sensors.ami0.drive0=degraded (sd0), WARNING
hw.sensors.ami0.drive1=online (sd1), OK
hw.sensors.ami0.drive2=online (sd2), OK
hw.sensors.safte0.temp0=23.00 degC, OK
hw.sensors.safte1.temp0=24.00 degC, OK
hw.sensors.lm1.temp0=40.00 degC
hw.sensors.lm1.temp1=29.00 degC
hw.sensors.lm1.temp2=29.50 degC
hw.sensors.lm1.fan0=6026 RPM
hw.sensors.lm1.fan1=6026 RPM


Hitachi's drive testing tool seems to be windows only, so are there any drive checking utilities that can check an individual drive when it's a part of a RAID1? Or is it safe to assume that if the drive fails in the RAID it is really dead. I'm trying to make sure I'm not seeing some kind of problem with the enclosure or the megaraid card before I start shipping drives back to Hitachi.

Can you get the SMART data from the drives?  Interpreting SMART data
is another problem, but maybe you can find a clue there.

Is it possible that the drives just took "too long" to read or write and
the RAID marked them bad?  Maybe remapping a bad sector takes too long...

Maybe hook them to a different controller (no RAID) and do a simple test
with dd over the entire drive, something like

dd if=/dev/suspect_disk of=/dev/null bs=1m
dd if=/dev/zero of=/dev/suspect_disk bs=1m

and see if you get any errors from dd or in dmesg.

Last night after all the users left I rebooted the server to get into the MegaRAID controller at boot. It couldn't see the brand new drive I just put into the safte0 enclosure so I couldn't make it a hot spare.

I installed the now two drives that have failed into another server with an identical setup (one minor variation--it has two separate LSI MegaRAID cards instead of one card with two channels) and a completely empty safte enclosure and again the card could not see the drives at all. I'm thinking that means they really are dead.

I have another chassis and a new SuperMicro motherboard with an onboard SCSI that I'll build up today. Then I should be able to get access to the individual drives without going through the LSI raid card and try to do those tests you suggest.

The fact that the LSI card couldn't see that new drive (identical in size, but 15K instead of 10K) is disconcerting to say the least. The only comforting thought is that in this case sd0 contains the / , swap, /usr and so on partitions--all operating system and no database or web server partitions. I think I'll double up on the tape backups, just to be sure.

Thanks for the suggestions.

Jeff

Reply via email to