On 09/04/2021 13:37, Filipe Manana wrote:
> On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn
> <[email protected]> wrote:
>>
>> When relocating a block group the freed up space is not discarded. On
>> devices like SSDs this hint is useful to tell the device the space is
>> freed now. On zoned block devices btrfs' discard code will reset the zone
>> the block group is on, freeing up the occupied space.
>>
>> Signed-off-by: Johannes Thumshirn <[email protected]>
>> ---
>> fs/btrfs/volumes.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 6d9b2369f17a..d9ef8bce0cde 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info
>> *fs_info, u64 chunk_offset)
>> struct btrfs_root *root = fs_info->chunk_root;
>> struct btrfs_trans_handle *trans;
>> struct btrfs_block_group *block_group;
>> + const bool trim = btrfs_is_zoned(fs_info) ||
>> + btrfs_test_opt(fs_info, DISCARD_SYNC);
>> + u64 trimmed;
>> + u64 length;
>> int ret;
>>
>> /*
>> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info
>> *fs_info, u64 chunk_offset)
>> if (!block_group)
>> return -ENOENT;
>> btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
>> + length = block_group->length;
>> btrfs_put_block_group(block_group);
>>
>> trans = btrfs_start_trans_remove_block_group(root->fs_info,
>> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info
>> *fs_info, u64 chunk_offset)
>> * step two, delete the device extents and the
>> * chunk tree entries
>> */
>> + if (trim) {
>> + ret = btrfs_discard_extent(fs_info, chunk_offset, length,
>> + &trimmed);
>
> Ideally we do IO and potentially slow operations such as discard while
> not holding a transaction handle open, to avoid slowing down anyone
> trying to commit the transaction.
>
>> + if (ret) {
>> + btrfs_abort_transaction(trans, ret);
>
> This is leaking the transaction, still need to call btrfs_end_transaction().
>
> Making the discard before joining/starting transaction, would fix both things.
Note taken.
>
> Now more importantly, I don't see why the freed space isn't already
> discarded at this point.
> Relocation creates delayed references to drop extents from the block
> group, and when the delayed references are run, we pin the extents
> through:
>
> __btrfs_free_extent() -> btrfs_update_block_group()
>
> Then at the very end of the transaction commit, we discard pinned
> extents and then unpin them, at btrfs_finish_extent_commit().
> Relocation commits the transaction at the end of
> relocate_block_group(), so the delayed references are fun, and then we
> discard and unpin their extents.
> So that's why I don't get it why the discard is needed here (at least
> when we are not on a zoned filesystem).
This is a good point to investigate, but as of now, the zone isn't reset.
Zone reset handling in btrfs is piggy backed onto discard handling, but
from testing the patchset I see the zone isn't reset:
[ 81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824
[ 81.015732] BTRFS info (device nullb0): relocating block group 1073741824
flags data
[ 81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move
data extents
[ 82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update
data pointers
# blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0
start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x0799a0 reset:0
non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
Whereas when the this patch is applied as well:
[ 85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824
[ 85.128211] BTRFS info (device nullb0): relocating block group 1073741824
flags data
[ 86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move
data extents
[ 86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update
data pointers
# blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0
start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0
non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
As a positive side effect of this, I now have code that can be used in
xfstests to verify the patchset.
Byte,
Johannes