On 2018年02月23日 09:12, Holger Hoffstätte wrote: > On 02/22/18 05:52, Qu Wenruo wrote: >> btrfs_read_block_groups() is used to build up the block group cache for >> all block groups, so it will iterate all block group items in extent >> tree. >> >> For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM >> thousands times, which is the most time consuming part of mounting >> btrfs. >> >> So this patch will try to speed it up by: >> >> 1) Avoid unnecessary readahead >> We were using READA_FORWARD to search for block group item. >> However block group items are in fact scattered across quite a lot of >> leaves. Doing readahead will just waste our IO (especially important >> for HDD). >> >> In real world case, for a filesystem with 3T used space, it would >> have about 50K extent tree leaves, but only have 3K block group >> items. Meaning we need to iterate 16 leaves to meet one block group >> on average. >> >> So readahead won't help but waste slow HDD seeks. >> >> 2) Use chunk mapping to locate block group items >> Since one block group item always has one corresponding chunk item, >> we could use chunk mapping to get the block group item size. >> >> With block group item size, we can do a pinpoint tree search, instead >> of searching with some uncertain value and do forward search. >> >> In some case, like next BLOCK_GROUP_ITEM is in the next leaf of >> current path, we could save such unnecessary tree block read. >> >> Cc: Ellis H. Wilson III <ell...@panasas.com> >> Signed-off-by: Qu Wenruo <w...@suse.com> >> --- >> Since all my TB level storage is all occupied by my NAS, any feedback >> (especially for the real world mount speed change) is welcome. > > (sorry for the previous mail without results..finger salad) > > Decided to give this a try & got some nice results! > Probably not on the same scale & nonlinear behaviour as Ellis will > provide since I just don't have that much storage, but interesting > nevertheless. > > $btrfs filesystem df /mnt/backup > Data, single: total=1.10TiB, used=1.09TiB > System, DUP: total=32.00MiB, used=144.00KiB > Metadata, DUP: total=4.00GiB, used=2.23GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > $btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l > 1137 > > current kernel (4.14++ with most of blk-mq+BFQ from 4.16): > mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.211 total > mount /mnt/backup 0.00s user 0.02s system 2% cpu 1.122 total > mount /mnt/backup 0.00s user 0.02s system 2% cpu 1.236 total > > patched: > mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.070 total > mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.056 total > mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.058 total > > That's not overwhelming, but still measurable and nice to have!
Looks pretty good. And pretty close to my prediction, about 15% improvement. > > While I was at it I decided to fill up the drive to almost-max > ~3.7TB and see how much slower it woulöd get...you won't believe > what happened next. :-) > > $btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l > 3719 > > mount /mnt/backup 0.00s user 0.02s system 2% cpu 1.328 total > mount /mnt/backup 0.00s user 0.03s system 2% cpu 1.361 total > mount /mnt/backup 0.00s user 0.03s system 2% cpu 1.368 total > > Over three times the data, almost the same mount time as before? > Yes please! > This is indeed out of my expectation. But after some think, this depends on how you fill the fs. If fill using fallocate/plain dd, it means most of them will be data using maximum extent size (128M), that will make new data blocks quite compact, just 8+1 items for each block group. So a leaf can easily contain around 100 chunks. This makes block groups iteration for new data block groups quite fast, as all related tree blocks will be in cache (also explains why cpu usage get increaseed). That's to say, if I didn't miss anything, for old kernel it should also be able to mount in 2 secs. This in fact would be the worst case for the patch, as normally readahead would benifit such case. But since it's pretty much the same mount time for old kernel, I would call this a win! > Overall this looks like a really nice improvement. Glad to see > that my suspicion about the (non)usefulness of the readhead turned > out to be true. :) Yep, readahead part is new for such speedup patch(there is similar patch years ago, but without disabling readahead), and pretty glad to see the result. (Although a little sad to know readahead is of little use) Thanks for your test, Qu > > cheers, > Holger > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Description: OpenPGP digital signature