On 2018年02月14日 00:24, Holger Hoffstätte wrote: > On 02/13/18 13:54, Qu Wenruo wrote: >> On 2018年02月13日 20:26, Holger Hoffstätte wrote: >>> On 02/13/18 12:40, Qu Wenruo wrote: >>>>>> The problem is not about how much space it takes, but how many extents >>>>>> are here in the filesystem. >>> >>> I have no idea why btrfs' mount even needs to touch all block groups to >>> get going (which seems to be the root of the problem), but here's a >>> not so crazy idea for more "mechanical sympathy". Feel free to mock >>> me if this is terribly wrong or not possible. ;) >>> >>> Mounting of even large filesystems (with many extents) seems to be fine >>> on SSDS, but not so fine on rotational storage. We've heard that from >>> several people with large (multi-TB) filesystems, and obviously it's >>> even more terrible on 5400RPM drives because their seeks are sooo sloow. >>> >>> If the problem is that the bgs are touched/iterated in "tree order", >>> would it then not be possible to sort the block groups in physical order >>> before trying to load whatever mount needs to load? >> >> This is in fact a good idea. >> Make block group into its own tree. > > Well, that's not what I was thinking about at all..yet. :) > (keep in mind I'm not really that familiar with the internals). > > Out of curiosity I ran a bit of perf on my own mount process, which is > fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and > small files. Unfortunately it's also very fresh since I recreated it just > this weekend, so everything is neatly packed together and fast. > > In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty > old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. > > My perf run shows that the only interesting part responsible for mount time > is the nested loop in btrfs_read_block_groups calling find_first_block_group > (which got inlined & is not in the perf callgraph) over and over again, > accounting for 75% of time spent. > > I now understand your comment why the real solution to this problem > is to move bgs into their own tree, and agree: both kitchens and databases > have figured out a long time ago that the key to fast scan and lookup > performance is to not put different things in the same storage container; > in the case of analytical DBMS this is columnar storage. :) > > But what I originally meant was something much simpler and more > brute-force-ish. I see that btrfs_read_block_groups adds readahead > (is that actually effective?) but what I was looking for was the equivalent > of a DBMS' sequential scan. Right now finding (and loading) a bg seems to > involve a nested loop of tree lookups. It seems easier to rip through the > entire tree in nice 8MB chunks and discard what you don't need instead > of seeking around trying to find all the right bits in scattered order.
The problem is, the tree (extent tree) containing block groups is very, very very large. It's a tree shared by all subvolumes. And since tree nodes and leaves can be scattered around the whole disk, it's pretty hard to do batch readahead. > > Could we alleviate cold mounts by starting more readaheads in > btrfs_read_block_groups, so that the extent tree is scanned more linearly? Since extent tree is not linear, it won't be as effective as we believe. Thanks, Qu > > cheers, > Holger >
Description: OpenPGP digital signature