Re: we are finding that parity writes are half of all writes when writing 50mb files

Hans Reiser Tue, 28 Nov 2000 00:29:06 -0800
Neil Brown wrote:
> 
> On Monday November 27, [EMAIL PROTECTED] wrote:
> > using reiserfs over raid5 with 5 disks.  This is unnecessarily suboptimal, it 
>should be that parity
> > writes are 20% of the disk bandwidth.  Comments?
> >
> > Is there a known reason why reiserfs over raid5 is way worse than ext2.  Does ext2 
>optimize for
> > raid5 in some way?
> >
> > Hans
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to [EMAIL PROTECTED]
> 
> Having read the 6 or so followups to this question so far, I see there
> is a reasonable mixture of information and misinformation floating
> around.
> As the the author of some of the software raid code in 2.4, and as one
> who has looked at it deeply and think I understand it all, let me try
> to present some useful facts.
> 
> 1/ raid5 maintains a stripe cache, currently containing 128 stripes,
>    each as wide as one filesystem block (not on raid5 chunk!).
>    When a read or write request arrives, it is attached to the
>    appropriate stripe.  If the needed stripe is not in the cache, a
>    free one is allocated.  If there are no free stripes, 1/8 of the
>    cache is freed up.
> 
> 2/ raid5 uses one of two methods to write out data and update the
>    parity for a stripe, rmw or rcw.
> 
>    rmw: The Read Modify Write method reads the old parity block, and
>       the old data blocks for any block that needs to be written,
>       calculates the correct parity by subtracting the old data
>       blocks, and adding the new data blocks to the parity block, and
>       then writing out the new data and new parity.  Note that some
>       old data may already be in the cache, and so will not need to be
>       read.
> 
>    rcw: The ReConstruct Write method reads any data for blocks in the
>       stripe that do NOT need to be written, and then calculates the
>       correct parity from all current data (whether old or new) and
>       writes out the new data and the new parity.
> 
>    It chooses between these two methods based on how many blocks need
>    to be pre-read before the parity calculation can be made.
> 
> 3/ In 2.4 (but not in 2.2) access to the stripe cache is overly single
>    threaded in the sense that only one request can ever be attached to
>    a given stripe at a time.  I gather that this was because the
>    author was faced with a need to make the code SMP safe (which was
>    not an issue in 2.2 do to the presence of the BKL) and not a lot of
>    time to work on it.  The resulting solution works, but is not optimal.
> 
>    Bless with more time, I have a patch which relaxes this restriction
>    and substantially improves throughput.  See
>     http://www.cse.unsw.edu.au/~neilb/patches/linux/
> 
>    It could be precisely this issue that Hans is seeing (if he is
>    using 2.4.  He didn't say).
> 
> 4/ As someone mentioned, the only optimisation ext2 has for raid is
>    the  "-R stride" option.
>    This tells ext2 the stride sizeof the raid array, meaning the
>    minimum stretch of virtual addresses that will span every device in
>    the array.  This is (the number of drives minus 1) times the chunk
>    size.  I believe that mkfs.ext2 wants this in units of the
>    filesystem-block-size, but I'm not sure.
> 
>    The effect of this is to position frequently accessed metadata,
>    such as allocation bitmaps, at different offsets into the stride, so
>    that it will get allocated evenly over all the different devices in
>    the array, so that there is no 'hot-drive' for metadata access.
> 
> 5/ The best way to write to a raid5 array is by writing large
>    contiguous stretches of data.  This should allow the raid5 array to
>    collect blocks into stripes and write a full stripe at a time,
>    which will not require any pre-reading.  However it is not easy to
>    do this in Linux.
> 
>    The "Best" interface for efficiently writing data is to have an
>    asynchronous write, and a synchronous flush.
>    The "write" says "write this whenever you like", and the "flush"
>    says "Ok, I want this written NOW".
>    Actually, it is possibly even better to have a three stages:
>      1/ this is ready to be written
>      2/ write this now please
>      3/ don't return until this is written.
> 
>    The Unix system call interface has "write" with a very broad
>    "flush" - you can flush a whole file, but I don't think that you
>    can flush individual byte ranges (I could be wrong).  It does
>    fairly well.
> 
>    The NFS network filesystem, in version two, only has synchronous
>    writes.  This makes writing a real bottle neck.  In version three,
>    asynchronous writes were introduced together with a "COMMIT" which
>    would flush a given byte range.  This makes writing of large files
>    much more efficient.
> 
>    The Linux block device layer only has one flavour of write
>    request.  It is not exactly a synchronous write, as the caller can
>    choose to wait or not. But that device driver has (almost) no way
>    of knowing where the caller is waiting or not.
> 
>    This makes it hard to collect blocks together into a stripe.  When
>    raid5 gets a write request, it really needs to start actiing on it
>    straight away, which means scheduling a read of the parity block
>    and the old data.  While it is waiting for the read to complete, it
>    may get some more writes attached to the stripe, and may even get a
>    full stripe, but it cannot continue until the reads complete, and
>    the reads may well have been wasted time.  For larger number of
>    drives, there could be less wastage, but there is still wastage.
> 
>    I included an (almost) above.  This is because there is a fairly
>    coarse method for drives to discover that a writer (or reader) is
>    now waiting for a response.  This is called "plugging".
>    A device may choose to "plug" itself when it gets an I/O request.
>    This causes the request to be queued but, but the queue doesn't get
>    processed.  Thus subsequent requests can be merged on the queue.
>    When the device gets unplugged, this smaller number of merged
>    requests gets dealt with more efficiently.
> 
>    However, there is only one "unplug-all-devices"(*) call in the API
>    that a reader or write can make.  It is not possible to unplug a
>    particular device, or better still, to unplug a particular request.
> 
>    This works fairly well when doing IO on a single device - say an
>    IDE drive, but when doing I/O on a raid array, which involves a
>    number of devices, there will be a lot of unplug-all-devices calls,
>    and plugging will not be able to be so effective.
> 
>    I have some patches which add plugging for raid5 writes, and it
>    DRAMATICALLY improves sequential write throughput on a 3 to 5 drive
>    raid5 array with a 4k chunk size.  With other configurations there
>    is an improvement, but it is not so dramatic.  There are various
>    reasons for this, but I believe that part of the reason is the
>    extra noise of unplug-all-device calls.  I haven't explored this
>    very thoroughly yet.
> 
>    So, in short, you can do better than the current code (see my
>    patches) but the Linux block-device API gets in the way a bit.
> 
>    In 2.2, a different approach was possible.  As all filesystem data
>    was in the buffer cache, which was physically addressed, the raid5
>    code could, when preparing to write, look in the buffer cache for
>    other blocks in the same stripe which were marked dirty, and
>    proactively write them (even though no write request had been
>    issued).  This improved performance substantially for 2.2 raid5
>    writing.  However is it not possible in 2.4 becuase filesystem
>    data is, by and large, not in the buffer cache - it is in the page
>    cage.
> 
>   (*) The unplug-all-devices call is spelt:
>           run_task_queue(&tq_disk);
> 
> 6/ With reference to Hans' question in a follow-up:
> 
>   Is the following statement correct?  Unless we write the whole stripe
>   next to instead of over the current data, we cannot guarantee
>   recoverability upon removal of a disk drive while the FS is in
>   operation, and this is likely to be much of the motivation for the
>   NetApp WAFL design as they gather writes into stripes (I think this
>   last is true, but not sure).
> 
>   RAID5 cannot survive an unclean shutdown with a failed drive.  This
>   is because a stripe may have been partially written at the point of
>   unclean shutdown, so reconstructing the missing block from the
>   remaining drives will likely produce garbage.
>   However, apart from that, there is no problems with loosing drives
>   while the FS is in operation.
> 
>   There are (at least) two effective response to this problem:
> 
>    1/ use NVRAM somewhere in the system so that you can effectively do
>      a two stage commit - commit data to NVRAM, and the write that
>      data to the array, and then release the data from NVRAM.
>      After an unclean shutdown, you re-write all data in NVRAM, and
>      you are safe.
> 
>      Ofcourse, the NVRAM could be replaced by any logging device, such
>      as a separated mirrored pair of drives, but there could be a
>      performance cost in that.
> 
>      This is a part of the NETAPP solution I believe.
> 
>   2/ Use a filesystem that
>        - knows about the raid stripe size, and
>        - only ever writes full stripes, and

We should work with you to get reiserfs to be able to do the above, yes?


>        - does so to stripes which didn't previously contain live
>          data, and
>        - knows which stripes it has written recently (even after an
>          unclean shutdown) and
>        - can tell if a stripe was written correctly or not.

These items require writing wandering logs, which are not likely to happen before 
June, and probably
come at a performance penalty, but we should do them also.

Hans
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Re: we are finding that parity writes are half of all writes when writing 50mb files

Reply via email to