Ross,

> > difference. The disks are each in removeable containers, which accept
> > only 40-pin connectors. The disks are therefore running at UDMA-33 at
> 
> Get them out now.  I've done a lot of experimenting with RAID5 over IDE
> and these things have been a disaster in most of my testing.  I built
> a RADI5 machine out of IBM Deskstar 18G drives when 18G was bleeding
> edge.

I think we have travelled the same path. This setup replaces a machine
that used RAID-1, -5 and -0 based on 4 17Gb drives. I got the same DMA
errors as below with that machine, but it was quite fast enough without
UDMA, and was otherwise entirely stable. This was using the same
removeable containers, so I think you may well be right that this is
what's causing the interference that's forcing IDE to run slower than
the maximum possible, but I doubt this is what is causing the lockup.

If you are suggesting that you should not use containers at all for IDE
RAID, then there is a bit of a conundrum here. The point of RAID is
(basically) to allow a machine to keep running if one of the disks
fails. But if you don't have your disks in removeable containers, you
will have to shutdown the machine to fix it anyway. I appreciate that
you retain many of the advantages, such as not having to reinstall and
restore from backup, but it is nevertheless far from being an ideal
solution. Particularly if it's your first disk that fails. I know you
can get round it, but it would be _much_ easier to raidhotadd a new disk
while the machine was still running than to take the machine down,
install the new disk, and then try to get it running with an empty first
disk.

I find it hard to believe that containers in general are a bad idea.
Maybe you had some dud ones, and maybe mine aren't great either (they
were pretty cheap), but they are fairly ubiquitous in a RAID setting.
Does anyone else on the list have experience with IDE removeable
containers? In particular, has anyone tried Promise's SuperSwap
enclosures, which look very cool, offering UDMA-100, hot-swap facilities
and built-in fan? I haven't seen these on offer in the UK, otherwise I
might think about dumping my cheap old units.

> We originally had some drives in these enclosures, but they caused so
> many crashes, we removed them.  It seems that few of them adequately
> pass signals and introduce a lot of cable noise.  I've often seen a
> vendor's drive test program report a drive as failed and unusable when
> placed in a removeable enclosure when the drive would pass all tests
> cleanly when installed direct to a ribbon cable.

As I say, my containers may have caused slowdowns in the past, but they
have never caused crashes to my knowledge. This machine's predecessor
ran for a couple of years with these containers without once crashing.

> > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> > hdb: DMA disabled
> > ide0: reset: success
> >
> > (I think these errors are a cable issue, but they do not prevent hda
> > working adequately for now)
> 
> I'd be more inclined to indicate the removeable enclosures.  I know
> it's a lot of trouble, but it would be very beneficial to get rid of
> them.

You may be right that they are responsible for these errors, but, unless
they are also responsible for the lockup, I think these errors are less
significant than the inconvenience of doing without containers
altogether. On the other hand, if anyone can recommend good containers
that I can get in the UK, I would be willing to look at replacing them.

> On the other hand these messages can indicate a failing drive.  Usually
> they will display endlessly on the screen for a failed drive, but
> before a drive fails they often show up occasionally, one at a time.

I doubt this is a failing drive. These are pretty new drives, and I
think I've had the messages since I got them.

> Now, I'm not familiar with the RAID code at all, but a glance reveals
> that this 'switching cache buffer size' message should be followed by
> a messages reading 'size now <newsize>'.  I looks like the code is
> spinning in shrink_stripe_cache while remove_hash'ing before it gets
> to the printk that would announce this new size.  No, I don't
> know why your setup would break, but might it be instructive for
> debugging purposes to insert some printk's around line 286 in
> raid5.c?  Perhaps if you knew where the kernel was locking there
> would be a better chance that someone familiar with the RAID code
> could see why this was happening.

I'll give it a go, but as I've mentioned in a previous message, Mandrake
8.0 is the first distro that has given me almost insuperable
difficulties in recompiling the kernel. So I'll have to crack that
first, but I guess I need to anyway. I think it's a problem with the
Duron.

At the moment, my linux experience feels like a bad example of entropy,
with everything that used to work fine gradually disintegrating before
my eyes! Besides these problems, my Adaptec 2940-UW is also giving me
problems for the first time in years, because of Mandrake 8.0's use of
the "new" (read "incomplete and buggy") AIC-7xxx drivers. Maybe if we
keep "improving" linux, we can aspire to the reliability standards of MS
Windows. :-( If it weren't for a couple of features (such as swap on
RAID and support for large disks), I'd be tempted to go back to an older
release. But, as it is, I'm trapped.

Yours, in frustration,

Bruno Prior
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Reply via email to