Re: ami and bioctl questions

Jeff Ross Tue, 18 Nov 2008 12:34:59 -0800

Marco Peereboom wrote:

On Mon, Nov 17, 2008 at 01:44:27PM -0700, Jeff Ross wrote:
Hi all,
At work I've got a server with an LSI MegaRAID (dmesg below) thatsuddenly seems to be killing hard drives. Last Thursday I had one drivefail, and the system didn't begin rebuilding onto the hot spare until Irebooted.
How did you create the hotspare?

If you created it using an old version of ami there is a good chance
that the hotspare creation didn't work right even though it shows up as
a hotspare (weird firmware requirements when creating the hotspare; read
the cvs logs for an explanation if you care).

I created it with bioctl, but my version is from a September 1 snapshotso it is before your fix.

Today I lost another drive in the same safte0. I pulled anotherreplacement drive off the shelf, swapped out the dead one, did a bioctl-H 0:9 sd0 to mark it as a hot spare but no rebuild has started yet.Note that 1:0 in safte1 was already marked as a hot spare, but this is aseparate safte enclosure and I've never been sure if the hot spare wouldwork across enclosures. I've always had a hot spare in each safteenclosure until this happened.
As long as the hotspares are on the same controller it does not matter
on what channel it is on.  Again see my previous blurb about creating
hotspares.

Okay, that's good to know. I still have one hotspare that I will try toget the rebuild initiated on.


Replacing the failed disk in the physical location with an appropriately
sized disk will kick off the rebuild.  In fact if you don't believe your
disks are actually failed remove the failed one (make sure that you run
some io to the logical disk when doing this) wait a few minutes and
reinsert it.  The ami card will then try to rebuild the raid set on that
disk.  This is obviously not recommended unless you know what you are
doing!

Unfortunately, I don't count myself in the camp of knowing what I'mdoing :-) but I'm learning more as I go. Probably I will have to trythis tonight after everyone else leaves.


I'll say this even though someone might yell at me...

Make sure you have appropriate cables and that the connectors are
plugged in right.  ami controllers are very sensitive to noise on the
cables (all U320 gear really is).  Don't use a shitty cable because that
might lead to phantom failed drives (if a command doesn't complete
within the required timeout the disk will be marked failed).

The server ran flawlessly for 2 years now, and I'll bet it's been a yearsince I've even slid it forward enough in the rack to get the cover off.Do cables go bad with use?

Also if you have a cheap enclosure that only supports up to a certain
speed you want to make sure that you throttle the channel to the
appropriate speed.  I have seen cheap enclosures pretend to run at U320
even though they really only could support U160.  The results were,...
odd.  You can change this in the CTRL-M BIOS during POST.

These are SuperMicro GEM enclosures that are rated for U320 and theyweren't cheap in my book but then that's a relative thing.

Here's the latest bioctl -i ami0

 [EMAIL PROTECTED]:/home/jross $ sudo bioctl -v -i ami0
Volume  Status               Size Device
 ami0 0 Degraded      72999763968 sd0     RAID1

0 Failed 73403465728 0:13.0 safte0 <HITACHIHUS151473VL3800 S3C0>

                                                 '        J5VHVNPB'

1 Online 73403465728 0:10.0 safte0 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3W09L5A0050B499004B'
 ami0 1 Online        72999763968 sd1     RAID1

0 Online 73403465728 0:11.0 safte0 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3W06MNA0050B4AD01D3'

1 Online 73403465728 0:12.0 safte0 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3W0A6VA0050B4A80C0C'
 ami0 2 Online        72999763968 sd2     RAID1

0 Online 73403465728 1:4.0 safte1 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3VZV2JA0050B4AX04C2'

1 Online 73403465728 1:1.0 safte1 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3W0726A0050B49W01CB'

ami0 3 Hot spare 73403465728 0:9.0 safte0 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3W093EA0050B44V0578'

ami0 4 Hot spare 73403465728 1:0.0 safte1 <HITACHIHUS103073FL3800 SA1B>

                                                 'V3W07PSA0050B4710207'

Also interesting is that safte0 will not blink any of the drives, whilesafte 1 will.


That is a safte problem.  ami sends a generic blink command to the safte
and it is up to it to honor it.


I may try to replace the safte0 enclosure this weekend then.

[EMAIL PROTECTED]:/home/jross $ sudo bioctl -b 0:9 ami0
bioctl: BIOCBLINK: Operation not supported by device
Questions, then: these drives are all Hitachi Ultrastars 10K300 from2005. Has any one had any bad experiences with them? They are allstill under warranty, and I don't suppose it's out of the question that2 drives out of 8 would fail within 72 hours of each other, especiallyif the lot was bad.
A bad cable can do this to you.  I have seen drives fail in all kinds of
different ways so this isn't as uncommon as you think either.  Be
cautious with those drives if you ascertain yourself that the rest of
the hardware is in good shape.
So far as I know, the SAFTE enclosures are identical. Why will onesupport blinking the drives and the other not?
Ask you enclosure vendor.
Should the ami be rebuilding the sd0 now that I've set a hot sparewithout any other action on my part, or do I need to kick off therebuild with bioctl -R 0:9 sd0.
Yes it should but due to a bug that I fixed recently the drives might
have been marked hotspare even though they would not kick in.

To fix this go into the CTRL-M BIOS and delete the hotspares and
recreate them there or with the latest and greatest ami driver.

Will do this tonight and then bring the system back up to -current withthe latest snapshot.

So far I haven't stumbled on the magic combination to make bioctl -q work:
[EMAIL PROTECTED]:/home/jross $sudo bioctl -q 1:4
bioctl: Can't locate 1:4 device via /dev/bio
[EMAIL PROTECTED]:/home/jross $ sudo bioctl -q ami0
bioctl: DIOCINQ: No such file or directory
[EMAIL PROTECTED]:/home/jross $ sudo bioctl -q sd0
bioctl: DIOCINQ: Invalid argument


-q is for sd devices; not physical id.

So the last combination combination I tried is the correct one? Itstill fails here:

[EMAIL PROTECTED]:/usr/src $ sudo bioctl -q sd0
bioctl: DIOCINQ: Invalid argument

Hitachi's drive testing tool seems to be windows only, so are there anydrive checking utilities that can check an individual drive when it's apart of a RAID1? Or is it safe to assume that if the drive fails in theRAID it is really dead. I'm trying to make sure I'm not seeing somekind of problem with the enclosure or the megaraid card before I startshipping drives back to Hitachi.
Meh drive testing tools.  Use at your own peril.

Okay, I'll get back to work getting this other server online so I canindividually mount those drives in an enclosure that is not connected toa RAID.

Thanks for your input and expertise, Marco. If you get a moment, Iposted a follow-up as a reply to Dieter's response. There are moreanomalies afoot I fear, especially with drives just not showing up, evenin the crtl-m bios.


Jeff

Re: ami and bioctl questions

Reply via email to