> btrfs_read_block_groups() is used to build up the block group cache for
> all block groups, so it will iterate all block group items in extent
> tree.
> For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
> thousands times, which is the most time consuming part of mounting
> btrfs.
> So this patch will try to speed it up by:
> 1) Avoid unnecessary readahead
>    We were using READA_FORWARD to search for block group item.
>    However block group items are in fact scattered across quite a lot of
>    leaves. Doing readahead will just waste our IO (especially important
>    for HDD).
>    In real world case, for a filesystem with 3T used space, it would
>    have about 50K extent tree leaves, but only have 3K block group
>    items. Meaning we need to iterate 16 leaves to meet one block group
>    on average.
>    So readahead won't help but waste slow HDD seeks.
> 2) Use chunk mapping to locate block group items
>    Since one block group item always has one corresponding chunk item,
>    we could use chunk mapping to get the block group item size.
>    With block group item size, we can do a pinpoint tree search, instead
>    of searching with some uncertain value and do forward search.
>    In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
>    current path, we could save such unnecessary tree block read.
> Cc: Ellis H. Wilson III <ell...@panasas.com>
> Signed-off-by: Qu Wenruo <w...@suse.com>
> ---
> Since all my TB level storage is all occupied by my NAS, any feedback
> (especially for the real world mount speed change) is welcome.

(sorry for the previous mail without results..finger salad)

Decided to give this a try & got some nice results!
Probably not on the same scale & nonlinear behaviour as Ellis will
provide since I just don't have that much storage, but interesting

$btrfs filesystem df /mnt/backup 
Data, single: total=1.10TiB, used=1.09TiB
System, DUP: total=32.00MiB, used=144.00KiB
Metadata, DUP: total=4.00GiB, used=2.23GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l

current kernel (4.14++ with most of blk-mq+BFQ from 4.16):
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.211 total
mount /mnt/backup  0.00s user 0.02s system 2% cpu 1.122 total
mount /mnt/backup  0.00s user 0.02s system 2% cpu 1.236 total

mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.070 total
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.056 total
mount /mnt/backup  0.00s user 0.02s system 1% cpu 1.058 total

That's not overwhelming, but still measurable and nice to have!

While I was at it I decided to fill up the drive to almost-max
~3.7TB and see how much slower it woulöd get...you won't believe
what happened next. :-)

$btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l

mount /mnt/backup  0.00s user 0.02s system 2% cpu 1.328 total
mount /mnt/backup  0.00s user 0.03s system 2% cpu 1.361 total
mount /mnt/backup  0.00s user 0.03s system 2% cpu 1.368 total

Over three times the data, almost the same mount time as before?
Yes please!

Overall this looks like a really nice improvement. Glad to see
that my suspicion about the (non)usefulness of the readhead turned
out to be true. :)

