First of all, I've been informed who Greg Oster is...a/the maintainer of
RAIDframe.  So, let's start by acknowledging his superior knowledge in
the area (possibly a little bias, but his knowledge of this topic is to
be respected).

I am NOT A file system expert.  I am barely file system aware.  Some
readers of my posts might confuse my knowledge of the OpenBSD boot
process and disk layout process as being file-system knowledgable.  That
would be a big error -- very different topics!

Greg Oster wrote:
> Nick Holland writes:
>> Greg Oster wrote:
...
>> > 2) Start extracting 5 copies of src.tar.gz onto the filesystem (
>> > simultanously is preferred, but basically anything that will generate 
>> > a lot of IO here is what is needed).
>> 
>> I wussed out here.  Did one unpacking of a Maildir in a .tgz file.  But
>> lots of IO, lots of thrashing, disks were basically saturated with work,
>> processor was waiting for disk.  Lots of tiny files.  On the other hand,
>> that's a lot more activity than this machine will ever see in production.
> 
> Um... that's just one thread of IO... 64K (or whatever MAXPHYS is) 
> presented, in sequence, to the underlying driver.  A rather boring 
> sequence of IO, with not much chance for one disk to get ahead or 
> behind the other in terms of servicing requests.
[snip a convincing "you are WRONG" argument :) ]

ok, let's try this again, then.

...
>> > 3) After that's been going for a while, and while still in progress, 
>> > pull the power from the machine.
>> 
>> Drop power mid write, you are risking your disk.  Yes, I have spiked
>> disks with a nail gun to test RAID in the past, but didn't feel like
>> possibly toasting two disks by powering down the machine mid-write at
>> this time.  This system has purpose for me. :)
> 
> Heh.. my RAID test box has a disk in external case.. disk 'failure' 
> is simulated by powering off that case... I don't know how many power 
> outages that poor little disk has seen :) 

Need to borrow a powder-actuated nail gun? :)
Nothing tests disk failure like a nail through a platter.  While it is
spinning.  It's fun, too! :)

I'm a definite RAID skeptic.  Testing stuff is a good thing.  I'm
learning all kinds of good stuff here. :)

Yes, wuss factor was acknowledged on my previous test.  This time,
however, I went for the power cord.  These are only 4G disks...If I end
up toasting them, far from the end of the world.  I can give you a very
good explaination (or several) for why a disk powered down mid-write
could be dammaged, it is really odd how RARELY this actually happens in
real life.  I come from the era when they told you to open the floppy
doors before powering down the machine and close them AFTER powering
them back up.
...
>> > 6) Do an md5 checksum of each of the parts of the mirror, and see if 
>> > they differ.  (they shouldn't, but I bet the do!!)
>> 
>> I think the md5 test of the mirror elements is bogus here.
>> I don't care if an unallocated block is different. I care if the files
>> are different.  I might not even care about that much.  See below...
> 
> Umm.... There is still a non-zero chance that metadata on one disk 
> will be different than metadata on the other, or that data on one 
> disk will be different than the other...

I'll agree to that ('specially following later results).  But I do not
see the point of getting excited about a difference in non-allocated
data.  My test is lame, yours is too strict.  I can't think of a test
that is "just right". :-/
...

> Your results here might lead us to wonder why RAID systems all worry 
> about keeping the mirrors in sync.. just think of all the cycles that 
> could be saved if they didn't bother!!  ;) 

Actually, that occured to me, yes.
HOWEVER, I wish to point out (again) that I am NOT a file system expert.
 Every OS and most HW based systems seem to compulsively rebuild
mirrors.  I think it is best to assume they know something I don't. :)

I know of two HW RAID systems which aren't so compulsive: Both the
Accusys and the Arco IDE mirroring boxes seem to be indifferent to
powerdowns (they should be indifferent to crashes, as they'll just
finish the last writes without the OS's help).  Come to think of it, the
"cheapie" BIOS-assisted SW-RAID cards I've played with on Windows seem
to do the same thing -- I'm guessing they just don't try to optimize
writes, so it "is write to disk 0, write the same thing to disk 1, and
don't let anything else happen in between."  I've heard they don't
perform as well as some of the "pure software mirroring" solutions, so
that may be evidence of this.

>>  My three tests indicated one can't universally even
>> demonstrate a difference in the written files, though I'd want to repeat
>> it an infinite number more times before I say "and there never will be a
>> difference". :)
> 
> If you have time, I'd try the test as originally outlined, or as 
> modified to have reading (or, better yet, heavy reading) being done 
> from the ccd mirror...  As "interesting" as your results are, they 
> a) don't surprise me nor b) have much to do with the test in question.

Yes, I think this was worth doing.

>> Yes, ccd(4) mirroring is not for every application.  But for some, it
>> can be useful.  My above mentioned DNS/DHCP server is an example -- I'd
>> like to keep two copies of constantly changing data.  If I lose one, I'd
>> like to have rapid repair.  If I lose them both, it will not be the end
>> of the world. 
> 
> I don't have a problem with people using ccd mirroring for data they 
> don't care about...  I do have a problem when they havn't fully 
> understood the implications, and believe it is doing something that 
> it isn't! 

yes.  I agree with you whole-heartedly on this.  I've been working on a
ccd(4) mirroring FAQ entry for a few months.  It will have some pretty
big disclaimers, bigger now, as I have verified, at least, in part, some
of your concerns.  It also has some pretty big disclaimers about RAID in
general.  My experience has been that most people are idiots about how
they implement any form of RAID (most notably, assuming some magic will
happen in the recovery process).

...
Let's get to the results of my second and third sets of tests...

First, I did eight untarings of src.tar.gz (from one file on a
non-mirrored partition to eight different destinations).  As it was
running, I realized I had forgot to delete the Maildir I had (partly)
unpacked before, so I launched an "rm -r" on that one.

This rather anemic machine was pretty much unusable by this point.  I'll
need some better hardware before I get too much more ugly than it is
currently. :)  This machine has a pair of 4G IDE drives, 64M RAM, and a
Celeron 333.  64M of RAM should mean not much was cached...much more
than that was written to disk, though I did use only one copy of
src.tar.gz, so some caching was taking place there, probably.

Initial comparison produced some weird results...massive numbers of "No
such file or directory" messages...until I realized the src.tar.gz file
I used contained symlinks to non-existant things in the obj directory.
So..yeah.  Expected.  But also would mask other errors, so I deleted
them from both test file systems.  Here were my results after that:


# diff -ur /home/test /mnt/test
Only in /mnt/test/1/gnu/egcs/gcc: cp
Only in /mnt/test/2/gnu/egcs/gcc: cccp.1
Only in /mnt/test/2/gnu/egcs/gcc: cccp.c
Only in /mnt/test/2/gnu/egcs/gcc: cexp.y
Only in /mnt/test/2/gnu/egcs/gcc: collect2.c
# diff -ur /home/Maildir/ /mnt/Maildir/

#

So, we DID have errors due to different content on the two disks on the
untar'ing, none on the rm'ing.

I hate ambigious failures, I'd much rather have a spectacular failure.
So, I repeated your tests, following a little closer to your guidelines,
and your revised guidelines.

I only had space for five copies of src.tar.gz on the smaller ccd(4)
mirrored partition (and even then, I was at 104% utilization!), so only
got five simultanious reads going here.

So here's the plan:
  500M /home partition
  1G   /var  partition
on ccd mirroring.

# ls /home
src.tar.gz   src1.tar.gz  src2.tar.gz  src3.tar.gz  src4.tar.gz

Run the following script:
--------
#!/bin/sh

mkdir /var/test
cd /var/test
mkdir 1 2 3 4 5

cd 1 && tar xzf /home/src.tar.gz &
cd 2 && tar xzf /home/src1.tar.gz &
cd 3 && tar xzf /home/src2.tar.gz &
cd 4 && tar xzf /home/src3.tar.gz &
cd 5 && tar xzf /home/src4.tar.gz &
--------

Wait for /var to get around 70% full (starting from 1% full)...
(*thrash*thrash*thrash*)
yank cord when df shows /var is to 70%...takes a while, this thing is
not fast.

Reboot (still mirroring).  fsck runs, lots of errors.

Get rid of the various obj symlinks:
# cd /var/test
# find . -type l -name obj | xargs rm

Split the mirror, mount the second half of /var on /mnt

# diff -ur /var/test /mnt/test
#

Ummmmmmm...no errors?
that wasn't what I expected.

These ARE rather old IDE drives with an old IDE interface....I suspect
newer drives and interfaces or SCSI drives with better support for
concurant disk activities might produce more spectacular failures.

Nick.

Reply via email to