On Mon, Nov 17, 2008 at 01:44:27PM -0700, Jeff Ross wrote:
> Hi all,
>
> At work I've got a server with an LSI MegaRAID (dmesg below) that  
> suddenly seems to be killing hard drives.  Last Thursday I had one drive  
> fail, and the system didn't begin rebuilding onto the hot spare until I  
> rebooted.

How did you create the hotspare?

If you created it using an old version of ami there is a good chance
that the hotspare creation didn't work right even though it shows up as
a hotspare (weird firmware requirements when creating the hotspare; read
the cvs logs for an explanation if you care).

>
> Today I lost another drive in the same safte0.  I pulled another  
> replacement drive off the shelf, swapped out the dead one, did a bioctl  
> -H 0:9 sd0 to mark it as a hot spare but no rebuild has started yet.  
> Note that 1:0 in safte1 was already marked as a hot spare, but this is a  
> separate safte enclosure and I've never been sure if the hot spare would  
> work across enclosures.  I've always had a hot spare in each safte  
> enclosure until this happened.

As long as the hotspares are on the same controller it does not matter
on what channel it is on.  Again see my previous blurb about creating
hotspares.

Replacing the failed disk in the physical location with an appropriately
sized disk will kick off the rebuild.  In fact if you don't believe your
disks are actually failed remove the failed one (make sure that you run
some io to the logical disk when doing this) wait a few minutes and
reinsert it.  The ami card will then try to rebuild the raid set on that
disk.  This is obviously not recommended unless you know what you are
doing!

I'll say this even though someone might yell at me...

Make sure you have appropriate cables and that the connectors are
plugged in right.  ami controllers are very sensitive to noise on the
cables (all U320 gear really is).  Don't use a shitty cable because that
might lead to phantom failed drives (if a command doesn't complete
within the required timeout the disk will be marked failed).

Also if you have a cheap enclosure that only supports up to a certain
speed you want to make sure that you throttle the channel to the
appropriate speed.  I have seen cheap enclosures pretend to run at U320
even though they really only could support U160.  The results were,...
odd.  You can change this in the CTRL-M BIOS during POST.

>
> Here's the latest bioctl -i ami0
>
>  [EMAIL PROTECTED]:/home/jross $ sudo bioctl -v -i ami0
> Volume  Status               Size Device
>  ami0 0 Degraded      72999763968 sd0     RAID1
>       0 Failed        73403465728 0:13.0  safte0 <HITACHI  
> HUS151473VL3800 S3C0>
>                                                  '        J5VHVNPB'
>       1 Online        73403465728 0:10.0  safte0 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3W09L5A0050B499004B'
>  ami0 1 Online        72999763968 sd1     RAID1
>       0 Online        73403465728 0:11.0  safte0 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3W06MNA0050B4AD01D3'
>       1 Online        73403465728 0:12.0  safte0 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3W0A6VA0050B4A80C0C'
>  ami0 2 Online        72999763968 sd2     RAID1
>       0 Online        73403465728 1:4.0   safte1 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3VZV2JA0050B4AX04C2'
>       1 Online        73403465728 1:1.0   safte1 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3W0726A0050B49W01CB'
>  ami0 3 Hot spare     73403465728 0:9.0   safte0 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3W093EA0050B44V0578'
>  ami0 4 Hot spare     73403465728 1:0.0   safte1 <HITACHI  
> HUS103073FL3800 SA1B>
>                                                  'V3W07PSA0050B4710207'
>
>
> Also interesting is that safte0 will not blink any of the drives, while  
> safte 1 will.

That is a safte problem.  ami sends a generic blink command to the safte
and it is up to it to honor it.

>
> [EMAIL PROTECTED]:/home/jross $ sudo bioctl -b 0:9 ami0
> bioctl: BIOCBLINK: Operation not supported by device
>
>
> Questions, then:  these drives are all Hitachi Ultrastars 10K300 from  
> 2005.  Has any one had any bad experiences with them?  They are all  
> still under warranty, and I don't suppose it's out of the question that  
> 2 drives out of 8 would fail within 72 hours of each other, especially  
> if the lot was bad.

A bad cable can do this to you.  I have seen drives fail in all kinds of
different ways so this isn't as uncommon as you think either.  Be
cautious with those drives if you ascertain yourself that the rest of
the hardware is in good shape.

>
> So far as I know, the SAFTE enclosures are identical.  Why will one  
> support blinking the drives and the other not?

Ask you enclosure vendor.

>
> Should the ami be rebuilding the sd0 now that I've set a hot spare  
> without any other action on my part, or do I need to kick off the  
> rebuild with bioctl -R 0:9 sd0.

Yes it should but due to a bug that I fixed recently the drives might
have been marked hotspare even though they would not kick in.

To fix this go into the CTRL-M BIOS and delete the hotspares and
recreate them there or with the latest and greatest ami driver.

>
> So far I haven't stumbled on the magic combination to make bioctl -q work:
> [EMAIL PROTECTED]:/home/jross $sudo bioctl -q 1:4
> bioctl: Can't locate 1:4 device via /dev/bio
> [EMAIL PROTECTED]:/home/jross $ sudo bioctl -q ami0
> bioctl: DIOCINQ: No such file or directory
> [EMAIL PROTECTED]:/home/jross $ sudo bioctl -q sd0
> bioctl: DIOCINQ: Invalid argument

-q is for sd devices; not physical id.

>
> Hitachi's drive testing tool seems to be windows only, so are there any  
> drive checking utilities that can check an individual drive when it's a  
> part of a RAID1?  Or is it safe to assume that if the drive fails in the  
> RAID it is really dead.  I'm trying to make sure I'm not seeing some  
> kind of problem with the enclosure or the megaraid card before I start  
> shipping drives back to Hitachi.

Meh drive testing tools.  Use at your own peril.

>
> Thanks!
>
> Jeff
>
> OpenBSD 4.4-current (GENERIC.MP) #860: Mon Sep  1 13:55:06 MDT 2008
>     [EMAIL PROTECTED]:/usr/src/sys/arch/i386/compile/GENERIC.MP
> cpu0: Intel(R) Xeon(TM) CPU 2.66GHz ("GenuineIntel" 686-class) 2.67 GHz
> cpu0:  
> FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,CNXT-ID,xTPR
> real mem  = 2146988032 (2047MB)
> avail mem = 2067562496 (1971MB)
> mainbus0 at root
> bios0 at mainbus0: AT/286+ BIOS, date 02/09/05, BIOS32 rev. 0 @ 0xf0010,  
> SMBIOS rev. 2.3 @ 0xf82a0 (48 entries)
> bios0: vendor American Megatrends Inc. version "080008" date 02/09/2005
> acpi0 at bios0: rev 0
> acpi0: tables DSDT FACP APIC OEMB
> acpi0: wakeup devices PS2K(S1) PS2M(S1) SMBS(S1) AUDI(S1) MODM(S1)  
> USB0(S1) USB1(S1) USB2(S1) P0P1(S1)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: apic clock running at 133MHz
> cpu1 at mainbus0: apid 6 (application processor)
> cpu1: Intel(R) Xeon(TM) CPU 2.66GHz ("GenuineIntel" 686-class) 2.67 GHz
> cpu1:  
> FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,CNXT-ID,xTPR
> cpu2 at mainbus0: apid 1 (application processor)
> cpu2: Intel(R) Xeon(TM) CPU 2.66GHz ("GenuineIntel" 686-class) 2.67 GHz
> cpu2:  
> FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,CNXT-ID,xTPR
> cpu3 at mainbus0: apid 7 (application processor)
> cpu3: Intel(R) Xeon(TM) CPU 2.66GHz ("GenuineIntel" 686-class) 2.67 GHz
> cpu3:  
> FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,CNXT-ID,xTPR
> ioapic0 at mainbus0: apid 8 pa 0xfec00000, version 20, 24 pins
> ioapic1 at mainbus0: apid 9 pa 0xfec80000, version 20, 24 pins
> ioapic2 at mainbus0: apid 10 pa 0xfec80400, version 20, 24 pins
> acpiprt0 at acpi0: bus 0 (PCI0)
> acpiprt1 at acpi0: bus 1 (P0P1)
> acpiprt2 at acpi0: bus 3 (P2P3)
> acpiprt3 at acpi0: bus 5 (P2P4)
> acpicpu0 at acpi0
> acpicpu1 at acpi0
> acpicpu2 at acpi0
> acpicpu3 at acpi0
> acpibtn0 at acpi0: SPBT
> bios0: ROM list: 0xc0000/0x8000 0xc8000/0x1800 0xc9800/0x2200
> pci0 at mainbus0 bus 0: configuration mode 1 (no bios)
> pchb0 at pci0 dev 0 function 0 "Intel E7501 Host" rev 0x01
> ppb0 at pci0 dev 2 function 0 "Intel E7500 PCI" rev 0x01
> pci1 at ppb0 bus 2
> "Intel 82870P2 IOxAPIC" rev 0x04 at pci1 dev 28 function 0 not configured
> ppb1 at pci1 dev 29 function 0 "Intel 82870P2 PCIX-PCIX" rev 0x04
> pci2 at ppb1 bus 5
> "Intel 82870P2 IOxAPIC" rev 0x04 at pci1 dev 30 function 0 not configured
> ppb2 at pci1 dev 31 function 0 "Intel 82870P2 PCIX-PCIX" rev 0x04
> pci3 at ppb2 bus 3
> ppb3 at pci3 dev 3 function 0 "IBM 133 PCIX-PCIX" rev 0x03
> pci4 at ppb3 bus 4
> ami0 at pci4 dev 0 function 0 "Symbios Logic MegaRAID 320" rev 0x02:  
> apic 9 int 0 (irq 10)
> ami0: LSI 532, 32b, FW 414C, BIOS vH429, 128MB RAM
> ami0: 2 channels, 0 FC loops, 3 logical drives
> scsibus0 at ami0: 40 targets, initiator 40
> sd0 at scsibus0 targ 0 lun 0: <AMI, Host drive #00, > SCSI2 0/direct fixed
> sd0: 69618MB, 512 bytes/sec, 142577664 sec total
> sd1 at scsibus0 targ 1 lun 0: <AMI, Host drive #01, > SCSI2 0/direct fixed
> sd1: 69618MB, 512 bytes/sec, 142577664 sec total
> sd2 at scsibus0 targ 2 lun 0: <AMI, Host drive #02, > SCSI2 0/direct fixed
> sd2: 69618MB, 512 bytes/sec, 142577664 sec total
> scsibus1 at ami0: 16 targets, initiator 16
> safte0 at scsibus1 targ 6 lun 0: <SUPER, GEM318, 0> SCSI2 3/processor fixed
> scsibus2 at ami0: 16 targets, initiator 16
> safte1 at scsibus2 targ 6 lun 0: <SUPER, GEM318, 0> SCSI2 3/processor fixed
> ahc0 at pci3 dev 6 function 0 "Adaptec AHA-29160 U160" rev 0x02: apic 9  
> int 4 (irq 10)
> scsibus3 at ahc0: 16 targets, initiator 7
> st0 at scsibus3 targ 6 lun 0: <SEAGATE, DAT 9SP40-000, 910B> SCSI3  
> 1/sequential removable
> uhci0 at pci0 dev 29 function 0 "Intel 82801CA/CAM USB" rev 0x02: apic 8  
> int 16 (irq 10)
> ppb4 at pci0 dev 30 function 0 "Intel 82801BA Hub-to-PCI" rev 0x42
> pci5 at ppb4 bus 1
> fxp0 at pci5 dev 1 function 0 "Intel 8255x" rev 0x10, i82551: apic 8 int  
> 17 (irq 5), address 00:e0:81:26:a9:e4
> inphy0 at fxp0 phy 1: i82555 10/100 PHY, rev. 4
> vga1 at pci5 dev 2 function 0 "ATI Rage XL" rev 0x27
> wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
> wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
> skc0 at pci5 dev 4 function 0 "D-Link Systems DGE-530T A1" rev 0x11,  
> Yukon (0x1): apic 8 int 19 (irq 9)
> sk0 at skc0 port A: address 00:13:46:72:3b:1d
> eephy0 at sk0 phy 0: Marvell 88E1011 Gigabit PHY, rev. 3
> ichpcib0 at pci0 dev 31 function 0 "Intel 82801CA LPC" rev 0x02
> pciide0 at pci0 dev 31 function 1 "Intel 82801CA IDE" rev 0x02: DMA,  
> channel 0 configured to compatibility, channel 1 configured to 
> compatibility
> atapiscsi0 at pciide0 channel 0 drive 0
> scsibus4 at atapiscsi0: 2 targets, initiator 7
> cd0 at scsibus4 targ 0 lun 0: <SONY, DVD RW DW-U18A, UYS4> ATAPI 5/cdrom  
> removable
> cd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2
> pciide0: channel 1 disabled (no drives)
> ichiic0 at pci0 dev 31 function 3 "Intel 82801CA/CAM SMBus" rev 0x02:  
> apic 8 int 17 (irq 0)
> iic0 at ichiic0
> lm1 at iic0 addr 0x29: W83782D
> spdmem0 at iic0 addr 0x50: 512MB DDR SDRAM registered ECC PC2100CL2.5
> spdmem1 at iic0 addr 0x51: 512MB DDR SDRAM registered ECC PC2300CL2.5
> spdmem2 at iic0 addr 0x54: 512MB DDR SDRAM registered ECC PC2100CL2.5
> spdmem3 at iic0 addr 0x55: 512MB DDR SDRAM registered ECC PC2100CL2.5
> usb0 at uhci0: USB revision 1.0
> uhub0 at usb0 "Intel UHCI root hub" rev 1.00/1.00 addr 1
> isa0 at ichpcib0
> isadma0 at isa0
> com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
> com0: console
> com1 at isa0 port 0x2f8/8 irq 3: ns16550a, 16 byte fifo
> pckbc0 at isa0 port 0x60/5
> pckbd0 at pckbc0 (kbd slot)
> pckbc0: using irq 1 for kbd slot
> wskbd0 at pckbd0: console keyboard, using wsdisplay0
> pmsi0 at pckbc0 (aux slot)
> pckbc0: using irq 12 for aux slot
> wsmouse0 at pmsi0 mux 0
> pcppi0 at isa0 port 0x61
> midi0 at pcppi0: <PC speaker>
> spkr0 at pcppi0
> wbsio0 at isa0 port 0x2e/2: W83627HF rev 0x3a
> lm2 at wbsio0 port 0x290/8: W83627HF
> npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
> mtrr: Pentium Pro MTRR support
> softraid0 at root
> root on sd0a swap on sd0b dump on sd0b

Reply via email to