Neil Brown wrote:
>
> On Monday November 27, [EMAIL PROTECTED] wrote:
> > using reiserfs over raid5 with 5 disks. This is unnecessarily suboptimal, it
>should be that parity
> > writes are 20% of the disk bandwidth. Comments?
> >
> > Is there a known reason why reiserfs over raid5 is way worse than ext2. Does ext2
>optimize for
> > raid5 in some way?
> >
> > Hans
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to [EMAIL PROTECTED]
>
> Having read the 6 or so followups to this question so far, I see there
> is a reasonable mixture of information and misinformation floating
> around.
> As the the author of some of the software raid code in 2.4, and as one
> who has looked at it deeply and think I understand it all, let me try
> to present some useful facts.
>
> 1/ raid5 maintains a stripe cache, currently containing 128 stripes,
> each as wide as one filesystem block (not on raid5 chunk!).
> When a read or write request arrives, it is attached to the
> appropriate stripe. If the needed stripe is not in the cache, a
> free one is allocated. If there are no free stripes, 1/8 of the
> cache is freed up.
>
> 2/ raid5 uses one of two methods to write out data and update the
> parity for a stripe, rmw or rcw.
>
> rmw: The Read Modify Write method reads the old parity block, and
> the old data blocks for any block that needs to be written,
> calculates the correct parity by subtracting the old data
> blocks, and adding the new data blocks to the parity block, and
> then writing out the new data and new parity. Note that some
> old data may already be in the cache, and so will not need to be
> read.
>
> rcw: The ReConstruct Write method reads any data for blocks in the
> stripe that do NOT need to be written, and then calculates the
> correct parity from all current data (whether old or new) and
> writes out the new data and the new parity.
>
> It chooses between these two methods based on how many blocks need
> to be pre-read before the parity calculation can be made.
>
> 3/ In 2.4 (but not in 2.2) access to the stripe cache is overly single
> threaded in the sense that only one request can ever be attached to
> a given stripe at a time. I gather that this was because the
> author was faced with a need to make the code SMP safe (which was
> not an issue in 2.2 do to the presence of the BKL) and not a lot of
> time to work on it. The resulting solution works, but is not optimal.
>
> Bless with more time, I have a patch which relaxes this restriction
> and substantially improves throughput. See
> http://www.cse.unsw.edu.au/~neilb/patches/linux/
>
> It could be precisely this issue that Hans is seeing (if he is
> using 2.4. He didn't say).
>
> 4/ As someone mentioned, the only optimisation ext2 has for raid is
> the "-R stride" option.
> This tells ext2 the stride sizeof the raid array, meaning the
> minimum stretch of virtual addresses that will span every device in
> the array. This is (the number of drives minus 1) times the chunk
> size. I believe that mkfs.ext2 wants this in units of the
> filesystem-block-size, but I'm not sure.
>
> The effect of this is to position frequently accessed metadata,
> such as allocation bitmaps, at different offsets into the stride, so
> that it will get allocated evenly over all the different devices in
> the array, so that there is no 'hot-drive' for metadata access.
>
> 5/ The best way to write to a raid5 array is by writing large
> contiguous stretches of data. This should allow the raid5 array to
> collect blocks into stripes and write a full stripe at a time,
> which will not require any pre-reading. However it is not easy to
> do this in Linux.
>
> The "Best" interface for efficiently writing data is to have an
> asynchronous write, and a synchronous flush.
> The "write" says "write this whenever you like", and the "flush"
> says "Ok, I want this written NOW".
> Actually, it is possibly even better to have a three stages:
> 1/ this is ready to be written
> 2/ write this now please
> 3/ don't return until this is written.
>
> The Unix system call interface has "write" with a very broad
> "flush" - you can flush a whole file, but I don't think that you
> can flush individual byte ranges (I could be wrong). It does
> fairly well.
>
> The NFS network filesystem, in version two, only has synchronous
> writes. This makes writing a real bottle neck. In version three,
> asynchronous writes were introduced together with a "COMMIT" which
> would flush a given byte range. This makes writing of large files
> much more efficient.
>
> The Linux block device layer only has one flavour of write
> request. It is not exactly a synchronous write, as the caller can
> choose to wait or not. But that device driver has (almost) no way
> of knowing where the caller is waiting or not.
>
> This makes it hard to collect blocks together into a stripe. When
> raid5 gets a write request, it really needs to start actiing on it
> straight away, which means scheduling a read of the parity block
> and the old data. While it is waiting for the read to complete, it
> may get some more writes attached to the stripe, and may even get a
> full stripe, but it cannot continue until the reads complete, and
> the reads may well have been wasted time. For larger number of
> drives, there could be less wastage, but there is still wastage.
>
> I included an (almost) above. This is because there is a fairly
> coarse method for drives to discover that a writer (or reader) is
> now waiting for a response. This is called "plugging".
> A device may choose to "plug" itself when it gets an I/O request.
> This causes the request to be queued but, but the queue doesn't get
> processed. Thus subsequent requests can be merged on the queue.
> When the device gets unplugged, this smaller number of merged
> requests gets dealt with more efficiently.
>
> However, there is only one "unplug-all-devices"(*) call in the API
> that a reader or write can make. It is not possible to unplug a
> particular device, or better still, to unplug a particular request.
>
> This works fairly well when doing IO on a single device - say an
> IDE drive, but when doing I/O on a raid array, which involves a
> number of devices, there will be a lot of unplug-all-devices calls,
> and plugging will not be able to be so effective.
>
> I have some patches which add plugging for raid5 writes, and it
> DRAMATICALLY improves sequential write throughput on a 3 to 5 drive
> raid5 array with a 4k chunk size. With other configurations there
> is an improvement, but it is not so dramatic. There are various
> reasons for this, but I believe that part of the reason is the
> extra noise of unplug-all-device calls. I haven't explored this
> very thoroughly yet.
>
> So, in short, you can do better than the current code (see my
> patches) but the Linux block-device API gets in the way a bit.
>
> In 2.2, a different approach was possible. As all filesystem data
> was in the buffer cache, which was physically addressed, the raid5
> code could, when preparing to write, look in the buffer cache for
> other blocks in the same stripe which were marked dirty, and
> proactively write them (even though no write request had been
> issued). This improved performance substantially for 2.2 raid5
> writing. However is it not possible in 2.4 becuase filesystem
> data is, by and large, not in the buffer cache - it is in the page
> cage.
>
> (*) The unplug-all-devices call is spelt:
> run_task_queue(&tq_disk);
>
> 6/ With reference to Hans' question in a follow-up:
>
> Is the following statement correct? Unless we write the whole stripe
> next to instead of over the current data, we cannot guarantee
> recoverability upon removal of a disk drive while the FS is in
> operation, and this is likely to be much of the motivation for the
> NetApp WAFL design as they gather writes into stripes (I think this
> last is true, but not sure).
>
> RAID5 cannot survive an unclean shutdown with a failed drive. This
> is because a stripe may have been partially written at the point of
> unclean shutdown, so reconstructing the missing block from the
> remaining drives will likely produce garbage.
> However, apart from that, there is no problems with loosing drives
> while the FS is in operation.
>
> There are (at least) two effective response to this problem:
>
> 1/ use NVRAM somewhere in the system so that you can effectively do
> a two stage commit - commit data to NVRAM, and the write that
> data to the array, and then release the data from NVRAM.
> After an unclean shutdown, you re-write all data in NVRAM, and
> you are safe.
>
> Ofcourse, the NVRAM could be replaced by any logging device, such
> as a separated mirrored pair of drives, but there could be a
> performance cost in that.
>
> This is a part of the NETAPP solution I believe.
>
> 2/ Use a filesystem that
> - knows about the raid stripe size, and
> - only ever writes full stripes, and
We should work with you to get reiserfs to be able to do the above, yes?
> - does so to stripes which didn't previously contain live
> data, and
> - knows which stripes it has written recently (even after an
> unclean shutdown) and
> - can tell if a stripe was written correctly or not.
These items require writing wandering logs, which are not likely to happen before
June, and probably
come at a performance penalty, but we should do them also.
Hans
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]