Marco Peereboom wrote:
On Mon, Nov 17, 2008 at 01:44:27PM -0700, Jeff Ross wrote:
Hi all,
At work I've got a server with an LSI MegaRAID (dmesg below) that
suddenly seems to be killing hard drives. Last Thursday I had one drive
fail, and the system didn't begin rebuilding onto the hot spare until I
rebooted.
How did you create the hotspare?
If you created it using an old version of ami there is a good chance
that the hotspare creation didn't work right even though it shows up as
a hotspare (weird firmware requirements when creating the hotspare; read
the cvs logs for an explanation if you care).
I created it with bioctl, but my version is from a September 1 snapshot
so it is before your fix.
Today I lost another drive in the same safte0. I pulled another
replacement drive off the shelf, swapped out the dead one, did a bioctl
-H 0:9 sd0 to mark it as a hot spare but no rebuild has started yet.
Note that 1:0 in safte1 was already marked as a hot spare, but this is a
separate safte enclosure and I've never been sure if the hot spare would
work across enclosures. I've always had a hot spare in each safte
enclosure until this happened.
As long as the hotspares are on the same controller it does not matter
on what channel it is on. Again see my previous blurb about creating
hotspares.
Okay, that's good to know. I still have one hotspare that I will try to
get the rebuild initiated on.
Replacing the failed disk in the physical location with an appropriately
sized disk will kick off the rebuild. In fact if you don't believe your
disks are actually failed remove the failed one (make sure that you run
some io to the logical disk when doing this) wait a few minutes and
reinsert it. The ami card will then try to rebuild the raid set on that
disk. This is obviously not recommended unless you know what you are
doing!
Unfortunately, I don't count myself in the camp of knowing what I'm
doing :-) but I'm learning more as I go. Probably I will have to try
this tonight after everyone else leaves.
I'll say this even though someone might yell at me...
Make sure you have appropriate cables and that the connectors are
plugged in right. ami controllers are very sensitive to noise on the
cables (all U320 gear really is). Don't use a shitty cable because that
might lead to phantom failed drives (if a command doesn't complete
within the required timeout the disk will be marked failed).
The server ran flawlessly for 2 years now, and I'll bet it's been a year
since I've even slid it forward enough in the rack to get the cover off.
Do cables go bad with use?
Also if you have a cheap enclosure that only supports up to a certain
speed you want to make sure that you throttle the channel to the
appropriate speed. I have seen cheap enclosures pretend to run at U320
even though they really only could support U160. The results were,...
odd. You can change this in the CTRL-M BIOS during POST.
These are SuperMicro GEM enclosures that are rated for U320 and they
weren't cheap in my book but then that's a relative thing.
Here's the latest bioctl -i ami0
[EMAIL PROTECTED]:/home/jross $ sudo bioctl -v -i ami0
Volume Status Size Device
ami0 0 Degraded 72999763968 sd0 RAID1
0 Failed 73403465728 0:13.0 safte0 <HITACHI
HUS151473VL3800 S3C0>
' J5VHVNPB'
1 Online 73403465728 0:10.0 safte0 <HITACHI
HUS103073FL3800 SA1B>
'V3W09L5A0050B499004B'
ami0 1 Online 72999763968 sd1 RAID1
0 Online 73403465728 0:11.0 safte0 <HITACHI
HUS103073FL3800 SA1B>
'V3W06MNA0050B4AD01D3'
1 Online 73403465728 0:12.0 safte0 <HITACHI
HUS103073FL3800 SA1B>
'V3W0A6VA0050B4A80C0C'
ami0 2 Online 72999763968 sd2 RAID1
0 Online 73403465728 1:4.0 safte1 <HITACHI
HUS103073FL3800 SA1B>
'V3VZV2JA0050B4AX04C2'
1 Online 73403465728 1:1.0 safte1 <HITACHI
HUS103073FL3800 SA1B>
'V3W0726A0050B49W01CB'
ami0 3 Hot spare 73403465728 0:9.0 safte0 <HITACHI
HUS103073FL3800 SA1B>
'V3W093EA0050B44V0578'
ami0 4 Hot spare 73403465728 1:0.0 safte1 <HITACHI
HUS103073FL3800 SA1B>
'V3W07PSA0050B4710207'
Also interesting is that safte0 will not blink any of the drives, while
safte 1 will.
That is a safte problem. ami sends a generic blink command to the safte
and it is up to it to honor it.
I may try to replace the safte0 enclosure this weekend then.
[EMAIL PROTECTED]:/home/jross $ sudo bioctl -b 0:9 ami0
bioctl: BIOCBLINK: Operation not supported by device
Questions, then: these drives are all Hitachi Ultrastars 10K300 from
2005. Has any one had any bad experiences with them? They are all
still under warranty, and I don't suppose it's out of the question that
2 drives out of 8 would fail within 72 hours of each other, especially
if the lot was bad.
A bad cable can do this to you. I have seen drives fail in all kinds of
different ways so this isn't as uncommon as you think either. Be
cautious with those drives if you ascertain yourself that the rest of
the hardware is in good shape.
So far as I know, the SAFTE enclosures are identical. Why will one
support blinking the drives and the other not?
Ask you enclosure vendor.
Should the ami be rebuilding the sd0 now that I've set a hot spare
without any other action on my part, or do I need to kick off the
rebuild with bioctl -R 0:9 sd0.
Yes it should but due to a bug that I fixed recently the drives might
have been marked hotspare even though they would not kick in.
To fix this go into the CTRL-M BIOS and delete the hotspares and
recreate them there or with the latest and greatest ami driver.
Will do this tonight and then bring the system back up to -current with
the latest snapshot.
So far I haven't stumbled on the magic combination to make bioctl -q work:
[EMAIL PROTECTED]:/home/jross $sudo bioctl -q 1:4
bioctl: Can't locate 1:4 device via /dev/bio
[EMAIL PROTECTED]:/home/jross $ sudo bioctl -q ami0
bioctl: DIOCINQ: No such file or directory
[EMAIL PROTECTED]:/home/jross $ sudo bioctl -q sd0
bioctl: DIOCINQ: Invalid argument
-q is for sd devices; not physical id.
So the last combination combination I tried is the correct one? It
still fails here:
[EMAIL PROTECTED]:/usr/src $ sudo bioctl -q sd0
bioctl: DIOCINQ: Invalid argument
Hitachi's drive testing tool seems to be windows only, so are there any
drive checking utilities that can check an individual drive when it's a
part of a RAID1? Or is it safe to assume that if the drive fails in the
RAID it is really dead. I'm trying to make sure I'm not seeing some
kind of problem with the enclosure or the megaraid card before I start
shipping drives back to Hitachi.
Meh drive testing tools. Use at your own peril.
Okay, I'll get back to work getting this other server online so I can
individually mount those drives in an enclosure that is not connected to
a RAID.
Thanks for your input and expertise, Marco. If you get a moment, I
posted a follow-up as a reply to Dieter's response. There are more
anomalies afoot I fear, especially with drives just not showing up, even
in the crtl-m bios.
Jeff