On Thu, 14 Oct 1999, Christian Reis wrote:
> I have a proto-production Root RAID1 array set up, and it's mostly working
> fine. I did have quite some trouble getting the root md set up and
> booting, but now that it's done, it's stable.
>
> I usually do a set of tests to know how well my R1 is working - 'hot'
> pulling drives data cables - perhaps not the smartest, but the closest I
> can get to the actual failure. On the IDE systems I've used, the chipset
> usually complains a lot but the systems goes on being quite usable - even
> without a raidhotremove.
>
> On this Adaptec 2940 box, however, the SCSI subsystem complains awfully
> when I hot-disconnect a drive. The system barely slows to a crawl, and
> makes me wonder what a disk failure will do to it. If I raidhotremove the
> partitions the system goes back to normal usability, but I then get a nice
> kernel oops after a while.
Hold it, stop. SCSI busses themselves are not designed to recover well
from the kind of torture you're subjecting this one too. Most (99.99+%) of
hard drive failures are graceful and do not result in a hung/locked SCSI
buss. The drive simply stops responding to data requests and sits there
quietly. When a drive does truly fail in such as way as to hang the bus,
the bus usually resets after a timeout, but every now and then a drive fails
with a data pin or something held to ground, and the whole bus grinds to a
halt.
Here's some hardware you might want to consider for better recoverability
with a drive failure:
Hot swappable caddies. We use some nice metal ones (metal transfers heat
better) with individual fans for each caddy. The drives stay cool, the ID
can be set from the receiver instead of the caddie, so that customers can
change out a failed drive and they don't have to find their "7 level" screw
driver. These remove the drive from the SCSI buss gracefully.
Secondly, a card with dual busses is more reliable than a single buss, and
two separate cards are safer than either of those.
I admit though that the RAID software needs some work too, but your testing
methodology is a bit harsh compared to most real world failures.
Scott Marlowe