On 23/03/2021 11:22, Johannes Thumshirn wrote:
> On 23/03/2021 10:57, Filipe Manana wrote:
>> On Fri, Mar 19, 2021 at 10:52 AM Johannes Thumshirn
>> <johannes.thumsh...@wdc.com> wrote:
>>>
>>> When a file gets deleted on a zoned file system, the space freed is not
>>> returned back into the block group's free space, but is migrated to
>>> zone_unusable.
>>>
>>> As this zone_unusable space is behind the current write pointer it is not
>>> possible to use it for new allocations. In the current implementation a
>>> zone is reset once all of the block group's space is accounted as zone
>>> unusable.
>>>
>>> This behaviour can lead to premature ENOSPC errors on a busy file system.
>>>
>>> Instead of only reclaiming the zone once it is completely unusable,
>>> kick off a reclaim job once the amount of unusable bytes exceeds a user
>>> configurable threshold between 51% and 100%. It can be set per mounted
>>> filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
>>> per default.
>>>
>>> Similar to reclaiming unused block groups, these dirty block groups are
>>> added to a to_reclaim list and then on a transaction commit, the reclaim
>>> process is triggered but after we deleted unused block groups, which will
>>> free space for the relocation process.
>>>
>>> Signed-off-by: Johannes Thumshirn <johannes.thumsh...@wdc.com>
>>> ---
>>>
>>> AFAICT sysfs_create_files() does not have the ability to provide a 
>>> is_visible
>>> callback, so the bg_reclaim_threshold sysfs file is visible for non zoned
>>> filesystems as well, even though only for zoned filesystems we're adding 
>>> block
>>> groups to the reclaim list. I'm all ears for a approach that is sensible in
>>> this regard.
>>>
>>>
>>>  fs/btrfs/block-group.c       | 84 ++++++++++++++++++++++++++++++++++++
>>>  fs/btrfs/block-group.h       |  2 +
>>>  fs/btrfs/ctree.h             |  3 ++
>>>  fs/btrfs/disk-io.c           | 11 +++++
>>>  fs/btrfs/free-space-cache.c  |  9 +++-
>>>  fs/btrfs/sysfs.c             | 35 +++++++++++++++
>>>  fs/btrfs/volumes.c           |  2 +-
>>>  fs/btrfs/volumes.h           |  1 +
>>>  include/trace/events/btrfs.h | 12 ++++++
>>>  9 files changed, 157 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>>> index 9ae3ac96a521..af9026795ddd 100644
>>> --- a/fs/btrfs/block-group.c
>>> +++ b/fs/btrfs/block-group.c
>>> @@ -1485,6 +1485,80 @@ void btrfs_mark_bg_unused(struct btrfs_block_group 
>>> *bg)
>>>         spin_unlock(&fs_info->unused_bgs_lock);
>>>  }
>>>
>>> +void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
>>> +{
>>> +       struct btrfs_block_group *bg;
>>> +       struct btrfs_space_info *space_info;
>>> +       int ret = 0;
>>> +
>>> +       if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
>>> +               return;
>>> +
>>> +       if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
>>> +               return;
>>> +
>>> +       mutex_lock(&fs_info->reclaim_bgs_lock);
>>> +       while (!list_empty(&fs_info->reclaim_bgs)) {
>>> +               bg = list_first_entry(&fs_info->reclaim_bgs,
>>> +                                     struct btrfs_block_group,
>>> +                                     bg_list);
>>> +               list_del_init(&bg->bg_list);
>>> +
>>> +               space_info = bg->space_info;
>>> +               mutex_unlock(&fs_info->reclaim_bgs_lock);
>>> +
>>> +               /* Don't want to race with allocators so take the 
>>> groups_sem */
>>> +               down_write(&space_info->groups_sem);
>>> +
>>> +               spin_lock(&bg->lock);
>>> +               if (bg->reserved || bg->pinned || bg->ro) {
>>> +                       /*
>>> +                        * We want to bail if we made new allocations or 
>>> have
>>> +                        * outstanding allocations in this block group.  We 
>>> do
>>> +                        * the ro check in case balance is currently acting 
>>> on
>>> +                        * this block group.
>>> +                        */
>>> +                       spin_unlock(&bg->lock);
>>> +                       up_write(&space_info->groups_sem);
>>> +                       goto next;
>>> +               }
>>> +               spin_unlock(&bg->lock);
>>> +
>>> +               ret = inc_block_group_ro(bg, 0);
>>> +               up_write(&space_info->groups_sem);
>>> +               if (ret < 0) {
>>> +                       ret = 0;
>>> +                       goto next;
>>> +               }
>>> +
>>> +               btrfs_info(fs_info, "reclaiming chunk %llu", bg->start);
>>> +               trace_btrfs_reclaim_block_group(bg);
>>> +               ret = btrfs_relocate_chunk(fs_info, bg->start);
>>
>> I think you forgot to test this with lockdep enabled, it should have
>> complained loudly otherwise.
>>
>> btrfs_relocate_chunk() calls lockdep_assert_head() to verify we are
>> holding fs_info->reclaim_bgs_lock,
>> but we just unlocked it.
> 
> I thought I did, but you're right. I'll need to send a v3 anyways addressing

OK as this was bugging me, I've found out the "hole" in my test routine. I'm
suffering from the internal lockdep HLOCK_CHAINS too low problem/bug, which 
turns
lockdep validation off. This one triggers in xfstests before a block group is 
dirty
enough to start reclaim.

Reply via email to