On Monday November 27, [EMAIL PROTECTED] wrote:
> using reiserfs over raid5 with 5 disks.  This is unnecessarily suboptimal, it should 
>be that parity
> writes are 20% of the disk bandwidth.  Comments?
> 
> Is there a known reason why reiserfs over raid5 is way worse than ext2.  Does ext2 
>optimize for
> raid5 in some way?
> 
> Hans
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]

Having read the 6 or so followups to this question so far, I see there
is a reasonable mixture of information and misinformation floating
around. 
As the the author of some of the software raid code in 2.4, and as one
who has looked at it deeply and think I understand it all, let me try
to present some useful facts.

1/ raid5 maintains a stripe cache, currently containing 128 stripes,
   each as wide as one filesystem block (not on raid5 chunk!).
   When a read or write request arrives, it is attached to the
   appropriate stripe.  If the needed stripe is not in the cache, a
   free one is allocated.  If there are no free stripes, 1/8 of the
   cache is freed up.

2/ raid5 uses one of two methods to write out data and update the
   parity for a stripe, rmw or rcw.
 
   rmw: The Read Modify Write method reads the old parity block, and
      the old data blocks for any block that needs to be written,
      calculates the correct parity by subtracting the old data
      blocks, and adding the new data blocks to the parity block, and
      then writing out the new data and new parity.  Note that some
      old data may already be in the cache, and so will not need to be
      read. 

   rcw: The ReConstruct Write method reads any data for blocks in the
      stripe that do NOT need to be written, and then calculates the
      correct parity from all current data (whether old or new) and
      writes out the new data and the new parity.

   It chooses between these two methods based on how many blocks need
   to be pre-read before the parity calculation can be made.

3/ In 2.4 (but not in 2.2) access to the stripe cache is overly single
   threaded in the sense that only one request can ever be attached to
   a given stripe at a time.  I gather that this was because the
   author was faced with a need to make the code SMP safe (which was
   not an issue in 2.2 do to the presence of the BKL) and not a lot of
   time to work on it.  The resulting solution works, but is not optimal.

   Bless with more time, I have a patch which relaxes this restriction
   and substantially improves throughput.  See
    http://www.cse.unsw.edu.au/~neilb/patches/linux/

   It could be precisely this issue that Hans is seeing (if he is
   using 2.4.  He didn't say).

4/ As someone mentioned, the only optimisation ext2 has for raid is
   the  "-R stride" option.
   This tells ext2 the stride sizeof the raid array, meaning the
   minimum stretch of virtual addresses that will span every device in
   the array.  This is (the number of drives minus 1) times the chunk
   size.  I believe that mkfs.ext2 wants this in units of the
   filesystem-block-size, but I'm not sure.

   The effect of this is to position frequently accessed metadata,
   such as allocation bitmaps, at different offsets into the stride, so
   that it will get allocated evenly over all the different devices in
   the array, so that there is no 'hot-drive' for metadata access.

5/ The best way to write to a raid5 array is by writing large
   contiguous stretches of data.  This should allow the raid5 array to
   collect blocks into stripes and write a full stripe at a time,
   which will not require any pre-reading.  However it is not easy to
   do this in Linux. 

   The "Best" interface for efficiently writing data is to have an
   asynchronous write, and a synchronous flush.
   The "write" says "write this whenever you like", and the "flush"
   says "Ok, I want this written NOW".
   Actually, it is possibly even better to have a three stages:
     1/ this is ready to be written
     2/ write this now please
     3/ don't return until this is written.

   The Unix system call interface has "write" with a very broad
   "flush" - you can flush a whole file, but I don't think that you
   can flush individual byte ranges (I could be wrong).  It does
   fairly well.

   The NFS network filesystem, in version two, only has synchronous
   writes.  This makes writing a real bottle neck.  In version three,
   asynchronous writes were introduced together with a "COMMIT" which
   would flush a given byte range.  This makes writing of large files
   much more efficient.

   The Linux block device layer only has one flavour of write
   request.  It is not exactly a synchronous write, as the caller can
   choose to wait or not. But that device driver has (almost) no way
   of knowing where the caller is waiting or not.

   This makes it hard to collect blocks together into a stripe.  When
   raid5 gets a write request, it really needs to start actiing on it
   straight away, which means scheduling a read of the parity block
   and the old data.  While it is waiting for the read to complete, it
   may get some more writes attached to the stripe, and may even get a
   full stripe, but it cannot continue until the reads complete, and
   the reads may well have been wasted time.  For larger number of
   drives, there could be less wastage, but there is still wastage.

   I included an (almost) above.  This is because there is a fairly
   coarse method for drives to discover that a writer (or reader) is
   now waiting for a response.  This is called "plugging".
   A device may choose to "plug" itself when it gets an I/O request.
   This causes the request to be queued but, but the queue doesn't get
   processed.  Thus subsequent requests can be merged on the queue.
   When the device gets unplugged, this smaller number of merged
   requests gets dealt with more efficiently.

   However, there is only one "unplug-all-devices"(*) call in the API
   that a reader or write can make.  It is not possible to unplug a
   particular device, or better still, to unplug a particular request.
   
   This works fairly well when doing IO on a single device - say an
   IDE drive, but when doing I/O on a raid array, which involves a
   number of devices, there will be a lot of unplug-all-devices calls,
   and plugging will not be able to be so effective.

   I have some patches which add plugging for raid5 writes, and it
   DRAMATICALLY improves sequential write throughput on a 3 to 5 drive
   raid5 array with a 4k chunk size.  With other configurations there
   is an improvement, but it is not so dramatic.  There are various
   reasons for this, but I believe that part of the reason is the
   extra noise of unplug-all-device calls.  I haven't explored this
   very thoroughly yet.

   So, in short, you can do better than the current code (see my
   patches) but the Linux block-device API gets in the way a bit.

   In 2.2, a different approach was possible.  As all filesystem data
   was in the buffer cache, which was physically addressed, the raid5
   code could, when preparing to write, look in the buffer cache for
   other blocks in the same stripe which were marked dirty, and
   proactively write them (even though no write request had been
   issued).  This improved performance substantially for 2.2 raid5
   writing.  However is it not possible in 2.4 becuase filesystem
   data is, by and large, not in the buffer cache - it is in the page
   cage.

  (*) The unplug-all-devices call is spelt:
          run_task_queue(&tq_disk);

6/ With reference to Hans' question in a follow-up:

  Is the following statement correct?  Unless we write the whole stripe
  next to instead of over the current data, we cannot guarantee
  recoverability upon removal of a disk drive while the FS is in
  operation, and this is likely to be much of the motivation for the
  NetApp WAFL design as they gather writes into stripes (I think this
  last is true, but not sure).

  RAID5 cannot survive an unclean shutdown with a failed drive.  This
  is because a stripe may have been partially written at the point of
  unclean shutdown, so reconstructing the missing block from the
  remaining drives will likely produce garbage.
  However, apart from that, there is no problems with loosing drives
  while the FS is in operation.

  There are (at least) two effective response to this problem:

   1/ use NVRAM somewhere in the system so that you can effectively do
     a two stage commit - commit data to NVRAM, and the write that
     data to the array, and then release the data from NVRAM.
     After an unclean shutdown, you re-write all data in NVRAM, and
     you are safe.

     Ofcourse, the NVRAM could be replaced by any logging device, such
     as a separated mirrored pair of drives, but there could be a
     performance cost in that.
 
     This is a part of the NETAPP solution I believe.

  2/ Use a filesystem that 
       - knows about the raid stripe size, and 
       - only ever writes full stripes, and
       - does so to stripes which didn't previously contain live
         data, and
       - knows which stripes it has written recently (even after an
         unclean shutdown) and
       - can tell if a stripe was written correctly or not.  

    Such a filesystem could, on restart, read and re-write all stripes
    which could have been written recently (since last sync), and so
    ensure correct parity for all valid data.  A log structured
    filesystem is ideal for this task, and writing one is on my todo
    list - though it is a rather large item :-)

    My understanding of NETAPP's WAFL is definately incomplete, but I
    don't believe that they gurantee to always do stripe wide writes,
    though they certainly try to encourage it.


I hope this helps.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Reply via email to