On Mon, Apr 12, 2021 at 2:49 PM Johannes Thumshirn <johannes.thumsh...@wdc.com> wrote: > > On 09/04/2021 13:37, Filipe Manana wrote: > > On Fri, Apr 9, 2021 at 11:54 AM Johannes Thumshirn > > <johannes.thumsh...@wdc.com> wrote: > >> > >> When relocating a block group the freed up space is not discarded. On > >> devices like SSDs this hint is useful to tell the device the space is > >> freed now. On zoned block devices btrfs' discard code will reset the zone > >> the block group is on, freeing up the occupied space. > >> > >> Signed-off-by: Johannes Thumshirn <johannes.thumsh...@wdc.com> > >> --- > >> fs/btrfs/volumes.c | 13 +++++++++++++ > >> 1 file changed, 13 insertions(+) > >> > >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > >> index 6d9b2369f17a..d9ef8bce0cde 100644 > >> --- a/fs/btrfs/volumes.c > >> +++ b/fs/btrfs/volumes.c > >> @@ -3103,6 +3103,10 @@ static int btrfs_relocate_chunk(struct > >> btrfs_fs_info *fs_info, u64 chunk_offset) > >> struct btrfs_root *root = fs_info->chunk_root; > >> struct btrfs_trans_handle *trans; > >> struct btrfs_block_group *block_group; > >> + const bool trim = btrfs_is_zoned(fs_info) || > >> + btrfs_test_opt(fs_info, DISCARD_SYNC); > >> + u64 trimmed; > >> + u64 length; > >> int ret; > >> > >> /* > >> @@ -3130,6 +3134,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info > >> *fs_info, u64 chunk_offset) > >> if (!block_group) > >> return -ENOENT; > >> btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group); > >> + length = block_group->length; > >> btrfs_put_block_group(block_group); > >> > >> trans = btrfs_start_trans_remove_block_group(root->fs_info, > >> @@ -3144,6 +3149,14 @@ static int btrfs_relocate_chunk(struct > >> btrfs_fs_info *fs_info, u64 chunk_offset) > >> * step two, delete the device extents and the > >> * chunk tree entries > >> */ > >> + if (trim) { > >> + ret = btrfs_discard_extent(fs_info, chunk_offset, length, > >> + &trimmed); > > > > Ideally we do IO and potentially slow operations such as discard while > > not holding a transaction handle open, to avoid slowing down anyone > > trying to commit the transaction. > > > >> + if (ret) { > >> + btrfs_abort_transaction(trans, ret); > > > > This is leaking the transaction, still need to call btrfs_end_transaction(). > > > > Making the discard before joining/starting transaction, would fix both > > things. > > Note taken. > > > > > Now more importantly, I don't see why the freed space isn't already > > discarded at this point. > > Relocation creates delayed references to drop extents from the block > > group, and when the delayed references are run, we pin the extents > > through: > > > > __btrfs_free_extent() -> btrfs_update_block_group() > > > > Then at the very end of the transaction commit, we discard pinned > > extents and then unpin them, at btrfs_finish_extent_commit(). > > Relocation commits the transaction at the end of > > relocate_block_group(), so the delayed references are fun, and then we > > discard and unpin their extents. > > So that's why I don't get it why the discard is needed here (at least > > when we are not on a zoned filesystem). > > This is a good point to investigate, but as of now, the zone isn't reset. > Zone reset handling in btrfs is piggy backed onto discard handling, but > from testing the patchset I see the zone isn't reset: > > [ 81.014752] BTRFS info (device nullb0): reclaiming chunk 1073741824 > [ 81.015732] BTRFS info (device nullb0): relocating block group 1073741824 > flags data > [ 81.798090] BTRFS info (device nullb0): found 12452 extents, stage: move > data extents > [ 82.314562] BTRFS info (device nullb0): found 12452 extents, stage: update > data pointers > # blkzone report -o $((1073741824 >> 9)) -c 1 /dev/nullb0 > start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x0799a0 reset:0 > non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
Not familiar with zoned device details, but what you are passing to blkzone is a logical address, in non-zoned btrfs it does not always matches the physical address on disk. So maybe that is not a reliable way to check it. > > Whereas when the this patch is applied as well: > [ 85.126542] BTRFS info (device nullb0): reclaiming chunk 1073741824 > [ 85.128211] BTRFS info (device nullb0): relocating block group 1073741824 > flags data > [ 86.061831] BTRFS info (device nullb0): found 12452 extents, stage: move > data extents > [ 86.766549] BTRFS info (device nullb0): found 12452 extents, stage: update > data pointers > # blkzone report -c 1 -o $((1073741824 >> 9)) /dev/nullb0 > start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 > non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] > > As a positive side effect of this, I now have code that can be used in > xfstests to verify the patchset. Ok. Maybe the zone isn't reset properly because the default mechanism is doing smaller discards, on a per extent basis, and perhaps the order of those discards matters? If it affects only the zoned case, I also don't see why doing it when not in zoned mode (and -o discard=sync is set). Unless of course the default discard mechanism is broken somehow, in which case we need to find out why and fix it. Thanks. > > Byte, > Johannes -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”