Re: Implementing low level timeouts within MD

Alberto Alonso Sat, 27 Oct 2007 23:30:26 -0700

On Sat, 2007-10-27 at 19:55 -0400, Doug Ledford wrote:
> On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote:
> > Regardless of the fact that it is not MD's fault, it does make
> > software raid an invalid choice when combined with those drivers. A
> > single disk failure within a RAID5 array bringing a file server down
> > is not a valid option under most situations.
> 
> Without knowing the exact controller you have and driver you use, I
> certainly can't tell the situation.  However, I will note that there are
> times when no matter how well the driver is written, the wrong type of
> drive failure *will* take down the entire machine.  For example, on an
> SPI SCSI bus, a single drive failure that involves a blown terminator
> will cause the electrical signaling on the bus to go dead no matter what
> the driver does to try and work around it.


Sorry I thought I copied the list with the info that I sent to Richard.
Here is the main hardware combinations.

--- Excerpt Start ----
Certainly. The times when I had good results (ie. failed drives
with properly degraded arrays have been with old PATA based IDE 
controllers built in the motherboard and the Highpoint PATA
cards). The failures (ie. single disk failure bringing the whole
server down) have been with the following:

* External disks on USB enclosures, both RAID1 and RAID5 (two different
  systems) Don't know the actual controller for these. I assume it is
  related to usb-storage, but can probably research the actual chipset,
  if it is needed.

* Internal serverworks PATA controller on a netengine server. The
  server if off waiting to get picked up, so I can't get the important
  details.

* Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 
  disks each. (only one drive on one array went bad)

* VIA VT6420 built into the MB with RAID1 across 2 SATA drives.

* And the most complex is this week's server with 4 PCI/PCI-X cards.
  But the one that hanged the server was a 4 disk RAID5 array on a
  RocketRAID1540 card.

--- Excerpt End ----

> 
> > I wasn't even asking as to whether or not it should, I was asking if
> > it could.
> 
> It could, but without careful control of timeouts for differing types of
> devices, you could end up making the software raid less reliable instead
> of more reliable overall.

Even if the default timeout was really long (ie. 1 minute) and then
configurable on a per device (or class) via /proc it would really help.

> Generally speaking, most modern drivers will work well.  It's easier to
> maintain a list of known bad drivers than known good drivers.

That's what has been so frustrating. The old PATA IDE hardware always
worked and the new stuff is what has crashed.

> Be careful which hardware raid you choose, as in the past several brands
> have been known to have the exact same problem you are having with
> software raid, so you may not end up buying yourself anything.  (I'm not
> naming names because it's been long enough since I paid attention to
> hardware raid driver issues that the issues I knew of could have been
> solved by now and I don't want to improperly accuse a currently well
> working driver of being broken)

I have settled for 3ware. All my tests showed that it performed quite
well and kicked drives out when needed. Of course, I haven't had a
bad drive on a 3ware production server yet, so.... I may end up
pulling the little bit of hair I have left.

I am now rushing the RocketRAID 2220 into production without testing
due to it being the only thing I could get my hands on. I'll report
any experiences as they happen.

Thanks for all the info,

Alberto


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Implementing low level timeouts within MD

Reply via email to