[expert] Software RAID 5 and data curruption

jason-snyder Thu, 25 Jan 2001 17:22:55 -0800
I am sending this again seeing that my last post never showed up on the
list.

My efforts to set up a software raid in my "spare" time has progressed a
little from last time around, but it seems that I have run into a rather
significant snag.

I am basing my attempt to build a root software raid 5 on the software
raid HOWTO (http://www.linuxdoc.org/HOWTO/Software-RAID-HOWTO.html).

Here is a list of relevent configurations:
I. Harware
    1. Asus K7V m/b with BIOS version 1.007.
    2. In-Win Q500 case w/ 300W p/s and 3 1/2 fan rigged to the front of

the chassis blowing over the internel 3 1/2 disk cage.
    3. 4 IBM DLTA-307045
    4. Promise Ultra100 controller acting as ide[23].
    5. Promise Ultra66 controller acting as ide[45].
    6. 256 MB of ECC PC133 RAM w/ ECC enabled.
II. Software configuration
    1. ide2 is the drive that I live off of for the moment.
    2. ide[345] are the test drives.
    3. software RAID 5 is what I am shooting for, so if I don't mention
what RAID level I am using assume that.
    4. default disk tunning goes as follows unless otherwise specified:
        hdparm -u1 -c1 -d1 -X69 /dev/hde
        hdparm -u1 -c1 -d1 -X69 /dev/hdg
        hdparm -u1 -c1 -d1 -X68 /dev/hdi
        hdparm -u1 -c1 -d1 -X68 /dev/hdk
    5. Stock mdk 7.2 kernel.

Here is how I first noticed a problem.  I was experimenting with a four
disk RAID 5 with one failed disk.  (Add it in once everything else is
moving smoothly because last disk to add is the disk that I am currently
running off of.)  I got through all of the initial stuff of building a
software raid and I made a copy of my machine to the raid devices using
cp -a.  When I went to boot off of a boot disk I noticed that there
where errors initializing /dev/md0 and /dev/md1 (/dev/md0 is the /
directory and /dev/md1 is the /home directory) and the machine would
kernel panick on boot.  When I booted the normal way I saw that in the
first round /dev/md[01]  got errors on the first round of initialization
while /dev/md[234] didn't and then all devices would get started in a
second round after the / file system was mounted.  When I mounted
/dev/md[01] (mount /dev/md0 test or mount /dev/md1 test) I would get
errors while mounting, but it would mount.  /dev/md[24] (/dev/md3 was
set up as swap) would not give me errors.

This section concerns the stage of confirming a real problem.  The
errors that I got seemed to demand further testing so I umounted all
raid devices and ran e2fsck on /dev/md[0124] and came across hundreds or
maybe even thousands of errors of many types including duplicate
blocks.  (I lost count of how many errors.)  After a little playing
around I went back and typed in hdparm -u0 for all of the disk drives
and tried to recopy the / directory.  When I recopied, umounted the
device and reran e2fsck I got a whole bunch errors like I did before.  I
then went and tried running 'badblocks -wv /dev/md0'.  After a little
while this reported back approximately 1400 bad blocks out of a 6 GB
RAID device.

This section concerns the more thurough testing that I conducted to try
to isolate the problem as best as possible.  After seeing all of those
bad blocks show up on a software RAID device with one disk intentionally
failed, I decided to see what would happen if I distroyed the RAID
devices and test each disk with one standard linux partition per disk
taking up the whole 45 GB disk.  In my first aborted attempt I noticed
that after I started checking the first disk, the other two would
stalled.  This was fixed by re-unmasking the IRQ's for each device
(hdparm -u1 /dev/hd[2345]).  I went and typed in badblocks -wv
/dev/hd[gik]1 for each of the disks again on different virtual terminals
and let them run simaltaniously.  (I noticed that when all disks where
being written to a second or so would go by where nothing was being
flushed to disk and everything else would go and then a flurry of disk
activity that would last a couple of seconds causing everything except
for the blinking curser to pause.  I also noticed when I opened top that
my 800 MHz Athlon was tacked out.)  To my surprise the first disk
(/dev/hdg1) finished a few minutes after the other two, but non of them
incurred any errors.  I then decided to see what would happen if I built
a software RAID five using three disks w/ no failed disks, using the
entire disk for one RAID device.  When I did this and ran 'badblocks -wv
/dev/md0' I came up with a final of 233,604 bad blocks out of 90,069,632
blocks total.  I looked over some of the output and it looked like
series of blocks would be marked bad of seemingly random lengths at
seemingly random intervals.  Plus on sequental passes it looks like
there is no consistancy between where the blocks are marked bad.  When I
checked for blocks in read-only mode I came up with no bad blocks.  When
I checked for bad blocks using the non destructive write mode the first
time around the machine locked up about 15 hours and 50 minutes into the
test.  (Used 'badblocks -nv /dev/md0'.  Also I saw no indication of
blocks being marked as bad before the hang up.)  The second time around
it went the full 16  hours, but ddn't mark any bads as being bad.  I
then converted the array to a RAID level 0 array.  When I ran 'badblocks
-wv /dev/md0' it found no bad blocks.

If someone can give input on what do do from here I would appreciate the
input.
[expert] Software RAID 5 and data curruption

Reply via email to