Re: Benchmarking btrfs on HW Raid ... BAD
On 09/28/2009 05:39 AM, Tobias Oetiker wrote: Hi Daniel, Today Daniel J Blueman wrote: On Mon, Sep 28, 2009 at 9:17 AM, Florian Weimerfwei...@bfk.de wrote: * Tobias Oetiker: Running this on a single disk, I get the quite acceptable results. When running on-top of a Areca HW Raid6 (lvm partitioned) then both read and write performance go down by at least 2 magnitudes. Does the HW RAID use write caching (preferably battery-backed)? I believe Areca controllers have an option for writeback or writethrough caching, so it's worth checking this and that you're running the current firmware, in case of errata. Ironically, disabling writeback will give the OS tighter control of request latency, but throughput may drop a lot. I still can't help thinking that this is down to the behaviour of the controller, due to the 1-disk case working well. it certainly is down to a behaviour of the controller, or the results would be the same as with a single sata disk :-) It would be interesting to see what results others get on HW Raid Controllers ... One way would be to configure the array as 6 or 7 devices, and allow BTRFS/DM to mange the array, then see if performance under write load is better, and with or without writeback caching... I can imagine that this would help, but since btrfs aims to be multipurpose, this does not realy help all that much since this fundamentally alters the 'conditions' at the moment the RAID contains different filesystem and is partitioned using lvm ... cheers tobi the results for ext3 fs look like this ... I would be more suspicious of the barrier/flushes being issued. If your write cache is non-volatile, we really do not want to send them down to this type of device. Flushing this type of cache could certainly be very, very expensive and slow. Try mount -o nobarrier and see if your performance (write cache still enabled on the controller) is back to expected levels, Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking btrfs on HW Raid ... BAD
Hi Ric, Today Ric Wheeler wrote: I would be more suspicious of the barrier/flushes being issued. If your write cache is non-volatile, we really do not want to send them down to this type of device. Flushing this type of cache could certainly be very, very expensive and slow. Try mount -o nobarrier and see if your performance (write cache still enabled on the controller) is back to expected levels, wow, indeed ... without special mount options I get the following from my RAID6 with non volatile cache: ## 1 readers (30s) -- A read dircnt 78845min0.001 msmax 29.713 msmean 0.027 msstdev 0.421 B lstat file cnt 73600min0.006 msmax 21.639 msmean 0.038 msstdev 0.273 C open file cnt 57862min0.013 msmax0.100 msmean 0.017 msstdev 0.003 D rd 1st byte cnt 57861min0.014 msmax 70.214 msmean 0.209 msstdev 0.919 E read rate 185.464 MB/s (data) 63.842 MB/s (readdir + open + 1st byte + data) 3 readers (30s) -- A read dircnt 41222min0.001 msmax 169.195 msmean 0.056 msstdev 1.113 B lstat file cnt 38447min0.006 msmax 79.977 msmean 0.064 msstdev 0.746 C open file cnt 30122min0.013 msmax0.042 msmean 0.018 msstdev 0.003 D rd 1st byte cnt 30122min0.014 msmax 597.264 msmean 0.535 msstdev 6.646 E read rate 124.144 MB/s (data) 31.197 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) -- F write open cnt107min0.063 msmax 70.593 msmean 0.760 msstdev 6.784 G wr 1st byte cnt107min0.006 msmax0.014 msmean 0.007 msstdev 0.002 H write close cnt107min0.017 msmax 1784.192 msmean 20.830 msstdev 176.474 I mkdir cnt 9min0.049 msmax9.184 msmean 1.079 msstdev 2.865 J write rate 0.200 MB/s (data) 0.199 MB/s (open + 1st byte + data) A read dircnt 1215min0.001 msmax 2661.328 msmean 4.008 msstdev 81.513 B lstat file cnt 1144min0.007 msmax 377.476 msmean 1.827 msstdev 18.844 C open file cnt928min0.014 msmax1.596 msmean 0.021 msstdev 0.056 D rd 1st byte cnt928min0.015 msmax 1936.262 msmean 25.187 msstdev 123.755 E read rate 9.199 MB/s (data) 0.792 MB/s (readdir + open + 1st byte + data) mounting with -o nobarrier I get ... ## 1 readers (30s) -- A read dircnt 78876min0.001 msmax 19.803 msmean 0.013 msstdev 0.228 B lstat file cnt 73624min0.006 msmax 18.032 msmean 0.034 msstdev 0.210 C open file cnt 57868min0.014 msmax0.041 msmean 0.017 msstdev 0.003 D rd 1st byte cnt 57869min0.019 msmax 417.725 msmean 0.225 msstdev 2.459 E read rate 177.779 MB/s (data) 63.375 MB/s (readdir + open + 1st byte + data) 3 readers (30s) -- A read dircnt 38209min0.001 msmax 26.745 msmean 0.025 msstdev 0.472 B lstat file cnt 35624min0.006 msmax 26.019 msmean 0.048 msstdev 0.410 C open file cnt 27874min0.014 msmax1.257 msmean 0.017 msstdev 0.008 D rd 1st byte cnt 27874min0.020 msmax 3197.520 msmean 0.626 msstdev 20.279 E read rate 98.242 MB/s (data) 27.763 MB/s (readdir + open + 1st byte + data) 3 readers, 3 writers (30s) -- F write open cnt 5957min0.061 msmax 591.787 msmean 0.457 msstdev 9.956 G wr 1st byte cnt 5956min0.006 msmax0.136 msmean 0.007 msstdev 0.002 H write close cnt 5957min0.017 msmax 1340.145 msmean 0.818 msstdev 22.442 I mkdir cnt574min0.034 msmax 11.094 msmean 0.083 msstdev 0.543 J write rate 9.766 MB/s (data) 8.705 MB/s (open + 1st byte + data) A read dircnt 15183min0.001 msmax 439.260 msmean 0.130 msstdev 4.150 B lstat file cnt 14199min0.006 msmax 200.212 msmean 0.152 msstdev 3.420 C open file cnt
[PATCH 2/2] btrfs: remove duplicates of filemap_ helpers
Use filemap_fdatawrite_range and filemap_fdatawait_range instead of local copies of the functions. For filemap_fdatawait_range that also means replacing the awkward old wait_on_page_writeback_range calling convention with the regular filemap byte offsets. Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/fs/btrfs/disk-io.c === --- linux-2.6.orig/fs/btrfs/disk-io.c 2009-09-30 13:55:25.396005824 -0300 +++ linux-2.6/fs/btrfs/disk-io.c2009-09-30 13:57:49.917005980 -0300 @@ -822,16 +822,14 @@ struct extent_buffer *btrfs_find_create_ int btrfs_write_tree_block(struct extent_buffer *buf) { - return btrfs_fdatawrite_range(buf-first_page-mapping, buf-start, - buf-start + buf-len - 1, WB_SYNC_ALL); + return filemap_fdatawrite_range(buf-first_page-mapping, buf-start, + buf-start + buf-len - 1); } int btrfs_wait_tree_block_writeback(struct extent_buffer *buf) { - return btrfs_wait_on_page_writeback_range(buf-first_page-mapping, - buf-start PAGE_CACHE_SHIFT, - (buf-start + buf-len - 1) - PAGE_CACHE_SHIFT); + return filemap_fdatawait_range(buf-first_page-mapping, + buf-start, buf-start + buf-len - 1); } struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr, Index: linux-2.6/fs/btrfs/ordered-data.c === --- linux-2.6.orig/fs/btrfs/ordered-data.c 2009-09-30 13:44:52.424274060 -0300 +++ linux-2.6/fs/btrfs/ordered-data.c 2009-09-30 13:56:56.751254722 -0300 @@ -458,7 +458,7 @@ void btrfs_start_ordered_extent(struct i * start IO on any dirty ones so the wait doesn't stall waiting * for pdflush to find them */ - btrfs_fdatawrite_range(inode-i_mapping, start, end, WB_SYNC_ALL); + filemap_fdatawrite_range(inode-i_mapping, start, end); if (wait) { wait_event(entry-wait, test_bit(BTRFS_ORDERED_COMPLETE, entry-flags)); @@ -488,17 +488,15 @@ again: /* start IO across the range first to instantiate any delalloc * extents */ - btrfs_fdatawrite_range(inode-i_mapping, start, orig_end, WB_SYNC_ALL); + filemap_fdatawrite_range(inode-i_mapping, start, orig_end); /* The compression code will leave pages locked but return from * writepage without setting the page writeback. Starting again * with WB_SYNC_ALL will end up waiting for the IO to actually start. */ - btrfs_fdatawrite_range(inode-i_mapping, start, orig_end, WB_SYNC_ALL); + filemap_fdatawrite_range(inode-i_mapping, start, orig_end); - btrfs_wait_on_page_writeback_range(inode-i_mapping, - start PAGE_CACHE_SHIFT, - orig_end PAGE_CACHE_SHIFT); + filemap_fdatawait_range(inode-i_mapping, start, orig_end); end = orig_end; found = 0; @@ -716,89 +714,6 @@ out: } -/** - * taken from mm/filemap.c because it isn't exported - * - * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range - * @mapping: address space structure to write - * @start: offset in bytes where the range starts - * @end: offset in bytes where the range ends (inclusive) - * @sync_mode: enable synchronous operation - * - * Start writeback against all of a mapping's dirty pages that lie - * within the byte offsets start, end inclusive. - * - * If sync_mode is WB_SYNC_ALL then this is a data integrity operation, as - * opposed to a regular memory cleansing writeback. The difference between - * these two operations is that if a dirty page/buffer is encountered, it must - * be waited upon, and not just skipped over. - */ -int btrfs_fdatawrite_range(struct address_space *mapping, loff_t start, - loff_t end, int sync_mode) -{ - struct writeback_control wbc = { - .sync_mode = sync_mode, - .nr_to_write = mapping-nrpages * 2, - .range_start = start, - .range_end = end, - }; - return btrfs_writepages(mapping, wbc); -} - -/** - * taken from mm/filemap.c because it isn't exported - * - * wait_on_page_writeback_range - wait for writeback to complete - * @mapping: target address_space - * @start: beginning page index - * @end: ending page index - * - * Wait for writeback to complete against pages indexed by start-end - * inclusive - */ -int btrfs_wait_on_page_writeback_range(struct address_space *mapping, - pgoff_t start, pgoff_t end) -{ - struct pagevec pvec; - int nr_pages; - int ret = 0; - pgoff_t