Re: Re: RAID-5 Recovery testing

Mike Black Sun, 23 May 1999 21:58:27 -0700
I did have a situation on a SCSI system when one disk got a write error,
then during the resync to bring the spare on-line the system got a read
error on one of the two remaining disks.

I was able to reboot the system, remap the bad blocks, then do a
"mkraid --really-force" to reestablish the super blocks and resync.  After
that, an "e2fsck" showed a couple bad inodes which I was able to restore
from backup.  Meanwhile, all the rest of the data was intact.  The only risk
with doing this (as far as I know) is that you need to ensure that
/etc/raidtab matches your current config (failures and raidhotadd's can
confuse things).

I guess it's not very obvious to everybody that mkraid does NOT destroy data
that already exists on the disk as long as you're just "rebuilding" the
system.

I think one thing we do need is the ability to make a raidtab from the
current system config.  Otherwise, I'm not sure how to do it.  Isn't this
information available in the superblock so we could ensure we're not
destroying our data?

________________________________________
Michael D. Black   Principal Engineer
[EMAIL PROTECTED]  407-676-2923,x203
http://www.csi.cc  Computer Science Innovations
http://www.csi.cc/~mike  My home page
FAX 407-676-2355
----- Original Message -----
From: Chris R. Brown <[EMAIL PROTECTED]>
To: Mike Black <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, May 18, 1999 12:02 PM
Subject: Re: RAID-5 Recovery testing


Mike, we know that IDE is not the best choice for servers and raid,
but we are looking to build a large cheap archive area that doesn't
need bleeding edge speed or reliability, but we do need security from
drive failures.

We have taken all resonable measures to insure that the IDE masters
and slaves are actually operating independently and can continue that
way if one of two on a channel goes down. (this can be done)

I don't even care if we need to restart the server if a singel drive
goes down. What I can't tollerate is data being lost if a single
drive fails which TEMPORARILY takes another drive off line.

We will infact put each IDE drive on its own channel, but our testing
reveals that RAID5 doesn't know when to stop (or maybe how to
stop gracefully).... If two drives go out I would hope that the array
just stop, not corrupt the data, or continue operating.

This same secenerio could happen with SCSI if there were several
drives on each of two controllers and a controller or cable went down.
I think you'd agree data corruption is not acceptable under these
conditions.

Comments anyone?

Chris Brown
[EMAIL PROTECTED]

> From:           "Mike Black" <[EMAIL PROTECTED]>

> IDE is not a good choice for RAID5 -- if you lose one drive on a pair of
> drives than you can lose both.  This is because of the Master/Slave
> relationship -- losing power on one will screw up both drives.  Either put
> each drive on a seperate IDE bus or upgrade to SCSI.  Otherwise you will
be
> risking your data.
>
> ________________________________________
> Michael D. Black   Principal Engineer
> [EMAIL PROTECTED]  407-676-2923,x203
> http://www.csi.cc  Computer Science Innovations
> http://www.csi.cc/~mike  My home page
> FAX 407-676-2355
> ----- Original Message -----
> From: Chris R. Brown <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Monday, May 17, 1999 9:08 PM
> Subject: RAID-5 Recovery testing
>
>
> We've experienced a few odd anomalies during testing our IDE RAID-5
> array ( 6 x 16gb =80g).
>
> We started with a good running array and did an e2fsck to ensure its
> integrity...
>
> We simulated a drive failure by disconnecting a drive's power, and if
> the IDE channel contained a second drive in the RAID5 array, the array
> was permanently hosed and couldn't be used, even though the
> RAID5 driver would report that it was running in ok degraded mode (5
> of 6 drives) and ALL remaining drives were functional and could be
> accessed.
>
> We reasoned that this was because the second IDE drive (on the
> channel with the failure) temporarily was offline for a brief instant
> during the "failure".  Can anyone confirm these findings, and if so,
> do they imply that elements of a RAID array must be on seperate IDE
> channels?
>
> It is our impression that the RAID5 array will not gracefully shut
> down, and most likely be corrupted if two drives temporarily fail, or
> even go off line at once.
>
>     Secondly, we had several instances where the RAID5 driver
> reported that it was running in degraded mode with four out of six
> drives functioning (Note: This array had no spares) - a seeming
> impossibility, but the array continued to operate.  Is this a bug?
> In these cases e2fsck found excessive errors and no data could be
> used.
>
> Third, we tried restarting the array, sometimes switching drives
> around on different channels and couldn't get all drives to be
> properly recognized by the RAID5 driver even though we correctly
> updated the /etc/raidtab file.  Would turning off the
> persistant-sperblock feature help out here?
>
>
> Many thanks for any help, suggestions, or comments,
>
> Chris Brown
> [EMAIL PROTECTED]
>
>
>
Re: Re: RAID-5 Recovery testing

Reply via email to