Hi Omar, Chris,

I have reviewed the free-space-tree code. It is a very nice feature.

However, I have a basic understanding question.

Let's say we are running a delayed ref which inserts a new EXTENT_ITEM
into the extent tree, e.g., we are in alloc_reserved_file_extent. At
this point we call remove_from_free_space_tree(), which updates the
free-space-tree about the allocated space. But this requires to COW
the free-space-tree itself. So we allocate a new tree block for the
free-space tree, and insert a new delayed ref, which will update the
extent tree about the new tree block allocation. We also insert a
delayed ref to free the previous copy of the free-space-tree block.

At some point we run these new delayed refs, so we insert/remove
EXTENT_ITEMs from the extent tree, and this in turn requires us to
update the free-space-tree again. So we need again to COW
free-space-tree blocks, generating more delayed refs.

At which point this recursion stops?

Do we assume that at some point all needed free-space tree blocks have
been COW'ed already, and we do not COW a tree block more than once per
transaction (unless it was written to disk due to memory pressure)?

Thanks!
Alex.


On Tue, Dec 29, 2015 at 11:19 PM, Chris Mason <c...@fb.com> wrote:
> On Tue, Sep 29, 2015 at 08:50:35PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval <osan...@fb.com>
>>
>> The free space cache has turned out to be a scalability bottleneck on
>> large, busy filesystems. When the cache for a lot of block groups needs
>> to be written out, we can get extremely long commit times; if this
>> happens in the critical section, things are especially bad because we
>> block new transactions from happening.
>>
>> The main problem with the free space cache is that it has to be written
>> out in its entirety and is managed in an ad hoc fashion. Using a B-tree
>> to store free space fixes this: updates can be done as needed and we get
>> all of the benefits of using a B-tree: checksumming, RAID handling,
>> well-understood behavior.
>>
>> With the free space tree, we get commit times that are about the same as
>> the no cache case with load times slower than the free space cache case
>> but still much faster than the no cache case. Free space is represented
>> with extents until it becomes more space-efficient to use bitmaps,
>> giving us similar space overhead to the free space cache.
>>
>> The operations on the free space tree are: adding and removing free
>> space, handling the creation and deletion of block groups, and loading
>> the free space for a block group. We can also create the free space tree
>> by walking the extent tree and clear the free space tree.
>>
>> Signed-off-by: Omar Sandoval <osan...@fb.com>
>
>> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
>> +{
>> +     struct btrfs_trans_handle *trans;
>> +     struct btrfs_root *tree_root = fs_info->tree_root;
>> +     struct btrfs_root *free_space_root;
>> +     struct btrfs_block_group_cache *block_group;
>> +     struct rb_node *node;
>> +     int ret;
>> +
>> +     trans = btrfs_start_transaction(tree_root, 0);
>> +     if (IS_ERR(trans))
>> +             return PTR_ERR(trans);
>> +
>> +     free_space_root = btrfs_create_tree(trans, fs_info,
>> +                                         BTRFS_FREE_SPACE_TREE_OBJECTID);
>> +     if (IS_ERR(free_space_root)) {
>> +             ret = PTR_ERR(free_space_root);
>> +             goto abort;
>> +     }
>> +     fs_info->free_space_root = free_space_root;
>> +
>> +     node = rb_first(&fs_info->block_group_cache_tree);
>> +     while (node) {
>> +             block_group = rb_entry(node, struct btrfs_block_group_cache,
>> +                                    cache_node);
>> +             ret = populate_free_space_tree(trans, fs_info, block_group);
>> +             if (ret)
>> +                     goto abort;
>> +             node = rb_next(node);
>> +     }
>> +
>> +     btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
>> +
>> +     ret = btrfs_commit_transaction(trans, tree_root);
>> +     if (ret)
>> +             return ret;
>> +
>> +     return 0;
>> +
>> +abort:
>> +     btrfs_abort_transaction(trans, tree_root, ret);
>> +     btrfs_end_transaction(trans, tree_root);
>> +     return ret;
>> +}
>> +
>
> Hi Omar,
>
> The only problem I've hit testing this stuff is where we create the tree
> on existing filesystems.  There are a few different problems here:
>
> 1) The populate code happens after resuming balance operations.  The
> balancing code could be changing these block groups while we scan them.
> I fixed this by moving the scan up earlier.
>
> 2) Delayed references may be run, which will also change the extent tree
> as we're scanning it.
>
> 3) We might need to commit the transaction to reclaim space.
>
> For now I'm ignoring #3 and adding a flag in fs_info that will make us
> skip delayed references.  This really isn't a good long term solution,
> we need to be able to do this on a per block group basis and make
> forward progress without pinning the delayed refs in ram.
>
> But for now, it'll do to get this into the tree for more testing.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to