Hi Omar, Chris, I have reviewed the free-space-tree code. It is a very nice feature.
However, I have a basic understanding question. Let's say we are running a delayed ref which inserts a new EXTENT_ITEM into the extent tree, e.g., we are in alloc_reserved_file_extent. At this point we call remove_from_free_space_tree(), which updates the free-space-tree about the allocated space. But this requires to COW the free-space-tree itself. So we allocate a new tree block for the free-space tree, and insert a new delayed ref, which will update the extent tree about the new tree block allocation. We also insert a delayed ref to free the previous copy of the free-space-tree block. At some point we run these new delayed refs, so we insert/remove EXTENT_ITEMs from the extent tree, and this in turn requires us to update the free-space-tree again. So we need again to COW free-space-tree blocks, generating more delayed refs. At which point this recursion stops? Do we assume that at some point all needed free-space tree blocks have been COW'ed already, and we do not COW a tree block more than once per transaction (unless it was written to disk due to memory pressure)? Thanks! Alex. On Tue, Dec 29, 2015 at 11:19 PM, Chris Mason <c...@fb.com> wrote: > On Tue, Sep 29, 2015 at 08:50:35PM -0700, Omar Sandoval wrote: >> From: Omar Sandoval <osan...@fb.com> >> >> The free space cache has turned out to be a scalability bottleneck on >> large, busy filesystems. When the cache for a lot of block groups needs >> to be written out, we can get extremely long commit times; if this >> happens in the critical section, things are especially bad because we >> block new transactions from happening. >> >> The main problem with the free space cache is that it has to be written >> out in its entirety and is managed in an ad hoc fashion. Using a B-tree >> to store free space fixes this: updates can be done as needed and we get >> all of the benefits of using a B-tree: checksumming, RAID handling, >> well-understood behavior. >> >> With the free space tree, we get commit times that are about the same as >> the no cache case with load times slower than the free space cache case >> but still much faster than the no cache case. Free space is represented >> with extents until it becomes more space-efficient to use bitmaps, >> giving us similar space overhead to the free space cache. >> >> The operations on the free space tree are: adding and removing free >> space, handling the creation and deletion of block groups, and loading >> the free space for a block group. We can also create the free space tree >> by walking the extent tree and clear the free space tree. >> >> Signed-off-by: Omar Sandoval <osan...@fb.com> > >> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info) >> +{ >> + struct btrfs_trans_handle *trans; >> + struct btrfs_root *tree_root = fs_info->tree_root; >> + struct btrfs_root *free_space_root; >> + struct btrfs_block_group_cache *block_group; >> + struct rb_node *node; >> + int ret; >> + >> + trans = btrfs_start_transaction(tree_root, 0); >> + if (IS_ERR(trans)) >> + return PTR_ERR(trans); >> + >> + free_space_root = btrfs_create_tree(trans, fs_info, >> + BTRFS_FREE_SPACE_TREE_OBJECTID); >> + if (IS_ERR(free_space_root)) { >> + ret = PTR_ERR(free_space_root); >> + goto abort; >> + } >> + fs_info->free_space_root = free_space_root; >> + >> + node = rb_first(&fs_info->block_group_cache_tree); >> + while (node) { >> + block_group = rb_entry(node, struct btrfs_block_group_cache, >> + cache_node); >> + ret = populate_free_space_tree(trans, fs_info, block_group); >> + if (ret) >> + goto abort; >> + node = rb_next(node); >> + } >> + >> + btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE); >> + >> + ret = btrfs_commit_transaction(trans, tree_root); >> + if (ret) >> + return ret; >> + >> + return 0; >> + >> +abort: >> + btrfs_abort_transaction(trans, tree_root, ret); >> + btrfs_end_transaction(trans, tree_root); >> + return ret; >> +} >> + > > Hi Omar, > > The only problem I've hit testing this stuff is where we create the tree > on existing filesystems. There are a few different problems here: > > 1) The populate code happens after resuming balance operations. The > balancing code could be changing these block groups while we scan them. > I fixed this by moving the scan up earlier. > > 2) Delayed references may be run, which will also change the extent tree > as we're scanning it. > > 3) We might need to commit the transaction to reclaim space. > > For now I'm ignoring #3 and adding a flag in fs_info that will make us > skip delayed references. This really isn't a good long term solution, > we need to be able to do this on a per block group basis and make > forward progress without pinning the delayed refs in ram. > > But for now, it'll do to get this into the tree for more testing. > > -chris > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html