On 09/04/2021 13:37, Filipe Manana wrote: > On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn > <johannes.thumsh...@wdc.com> wrote: >> >> When relocating a block group the freed up space is not discarded. On >> devices like SSDs this hint is useful to tell the device the space is >> freed now. On zoned block devices btrfs' discard code will reset the zone >> the block group is on, freeing up the occupied space. >> >> Signed-off-by: Johannes Thumshirn <johannes.thumsh...@wdc.com> >> --- >> fs/btrfs/volumes.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 6d9b2369f17a..d9ef8bce0cde 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info >> *fs_info, u64 chunk_offset) >> struct btrfs_root *root = fs_info->chunk_root; >> struct btrfs_trans_handle *trans; >> struct btrfs_block_group *block_group; >> + const bool trim = btrfs_is_zoned(fs_info) || >> + btrfs_test_opt(fs_info, DISCARD_SYNC); >> + u64 trimmed; >> + u64 length; >> int ret; >> >> /* >> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info >> *fs_info, u64 chunk_offset) >> if (!block_group) >> return -ENOENT; >> btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group); >> + length = block_group->length; >> btrfs_put_block_group(block_group); >> >> trans = btrfs_start_trans_remove_block_group(root->fs_info, >> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info >> *fs_info, u64 chunk_offset) >> * step two, delete the device extents and the >> * chunk tree entries >> */ >> + if (trim) { >> + ret = btrfs_discard_extent(fs_info, chunk_offset, length, >> + &trimmed); > > Ideally we do IO and potentially slow operations such as discard while > not holding a transaction handle open, to avoid slowing down anyone > trying to commit the transaction. > >> + if (ret) { >> + btrfs_abort_transaction(trans, ret); > > This is leaking the transaction, still need to call btrfs_end_transaction(). > > Making the discard before joining/starting transaction, would fix both things.
Note taken. > > Now more importantly, I don't see why the freed space isn't already > discarded at this point. > Relocation creates delayed references to drop extents from the block > group, and when the delayed references are run, we pin the extents > through: > > __btrfs_free_extent() -> btrfs_update_block_group() > > Then at the very end of the transaction commit, we discard pinned > extents and then unpin them, at btrfs_finish_extent_commit(). > Relocation commits the transaction at the end of > relocate_block_group(), so the delayed references are fun, and then we > discard and unpin their extents. > So that's why I don't get it why the discard is needed here (at least > when we are not on a zoned filesystem). This is a good point to investigate, but as of now, the zone isn't reset. Zone reset handling in btrfs is piggy backed onto discard handling, but from testing the patchset I see the zone isn't reset: [ 81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824 [ 81.015732] BTRFS info (device nullb0): relocating block group 1073741824 flags data [ 81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move data extents [ 82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers # blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0 start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x0799a0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)] Whereas when the this patch is applied as well: [ 85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824 [ 85.128211] BTRFS info (device nullb0): relocating block group 1073741824 flags data [ 86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move data extents [ 86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update data pointers # blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0 start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] As a positive side effect of this, I now have code that can be used in xfstests to verify the patchset. Byte, Johannes