[PATCH][v2] btrfs: run delayed items before dropping the snapshot
From: Josef Bacik With my delayed refs patches in place we started seeing a large amount of aborts in __btrfs_free_extent BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0 Call Trace: ? btrfs_merge_delayed_refs+0xaf/0x340 __btrfs_run_delayed_refs+0x6ea/0xfc0 ? btrfs_set_path_blocking+0x31/0x60 btrfs_run_delayed_refs+0xeb/0x180 btrfs_commit_transaction+0x179/0x7f0 ? btrfs_check_space_for_delayed_refs+0x30/0x50 ? should_end_transaction.isra.19+0xe/0x40 btrfs_drop_snapshot+0x41c/0x7c0 btrfs_clean_one_deleted_snapshot+0xb5/0xd0 cleaner_kthread+0xf6/0x120 kthread+0xf8/0x130 ? btree_invalidatepage+0x90/0x90 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 This was because btrfs_drop_snapshot depends on the root not being modified while it's dropping the snapshot. It will unlock the root node (and really every node) as it walks down the tree, only to re-lock it when it needs to do something. This is a problem because if we modify the tree we could cow a block in our path, which free's our reference to that block. Then once we get back to that shared block we'll free our reference to it again, and get ENOENT when trying to lookup our extent reference to that block in __btrfs_free_extent. This is ultimately happening because we have delayed items left to be processed for our deleted snapshot _after_ all of the inodes are closed for the snapshot. We only run the delayed inode item if we're deleting the inode, and even then we do not run the delayed insertions or delayed removals. These can be run at any point after our final inode does it's last iput, which is what triggers the snapshot deletion. We can end up with the snapshot deletion happening and then have the delayed items run on that file system, resulting in the above problem. This problem has existed forever, however my patches made it much easier to hit as I wake up the cleaner much more often to deal with delayed iputs, which made us more likely to start the snapshot dropping work before the transaction commits, which is when the delayed items would generally be run. Before, generally speaking, we would run the delayed items, commit the transaction, and wakeup the cleaner thread to start deleting snapshots, which means we were less likely to hit this problem. You could still hit it if you had multiple snapshots to be deleted and ended up with lots of delayed items, but it was definitely harder. Fix for now by simply running all the delayed items before starting to drop the snapshot. We could make this smarter in the future by making the delayed items per-root, and then simply drop any delayed items for roots that we are going to delete. But for now just a quick and easy solution is the safest. Cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik --- v1->v2: - check for errors from btrfs_run_delayed_items. - Dave I can reroll the series, but the second version of patch 1 is the same, let me know what you want. fs/btrfs/extent-tree.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dcb699dd57f3..473084eb7a2d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root, goto out_free; } + err = btrfs_run_delayed_items(trans); + if (err) + goto out_end_trans; + if (block_rsv) trans->block_rsv = block_rsv; -- 2.14.3
Re: [RFC PATCH] btrfs: Remove __extent_readpages
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote: > When extent_readpages is called from the generic readahead code it first > builds a batch of 16 pages (which might or might not be consecutive, > depending on whether add_to_page_cache_lru failed) and submits them to > __extent_readpages. The latter ensures that the range of pages (in the > batch of 16) that is passed to __do_contiguous_readpages is consecutive. > > If add_to_page_cache_lru does't fail then __extent_readpages will call > __do_contiguous_readpages only once with the whole batch of 16. > Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an > example) > then the contigous page read code will be called twice. > > All of this can be simplified by exploiting the fact that all pages passed > to extent_readpages are consecutive, thus when the batch is built in > that function it is already consecutive (barring add_to_page_cache_lru > failures) so are ready to be submitted directly to __do_contiguous_readpages. > Also simplify the name of the function to contiguous_readpages. > > Signed-off-by: Nikolay Borisov > --- > > So this patch looks like a very nice cleanup, however when doing performance > measurements with fio I was shocked to see that it actually is detrimental to > performance. Here are the results: > > The command line used for fio: > fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k > --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 > > This was tested on a vm with 4g of ram so the size of the test is smaller > than > the memory, so pages should have been nicely readahead. > What this patch changes is now we aren't reading all of the pages we are passed by the readahead, so now we fall back to per-page reading when we go to read those pages because the readahead window has already moved past them. This isn't great behavior, what we have is nice in that it tries to group the entire range together as much as possible. What your patch changes is that as soon as we stop having a contiguous range we just bail and submit what we have. Testing it in isolation is likely to show very little change, but having recently touched the fault in code where we definitely do not count major faults in some cases I'd suspect that we're not reflecting this higher fault rate in the performance counters properly. We should preserve the existing behavior, what hurts a little bit on a lightly loaded box is going to hurt a whole lot more on a heavily loaded box. Thanks, Josef
Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
On Tue, Dec 04, 2018 at 01:46:58PM +0200, Nikolay Borisov wrote: > > > On 3.12.18 г. 18:06 ч., Josef Bacik wrote: > > The throttle path doesn't take cleaner_delayed_iput_mutex, which means > > we could think we're done flushing iputs in the data space reservation > > path when we could have a throttler doing an iput. There's no real > > reason to serialize the delayed iput flushing, so instead of taking the > > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just > > replace it with an atomic counter and a waitqueue. This removes the > > short (or long depending on how big the inode is) window where we think > > there are no more pending iputs when there really are some. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.h | 4 +++- > > fs/btrfs/disk-io.c | 5 ++--- > > fs/btrfs/extent-tree.c | 13 - > > fs/btrfs/inode.c | 21 + > > 4 files changed, 34 insertions(+), 9 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index dc56a4d940c3..20af5d6d81f1 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -915,7 +915,8 @@ struct btrfs_fs_info { > > > > spinlock_t delayed_iput_lock; > > struct list_head delayed_iputs; > > - struct mutex cleaner_delayed_iput_mutex; > > + atomic_t nr_delayed_iputs; > > + wait_queue_head_t delayed_iputs_wait; > > > > Have you considered whether the same could be achieved with a completion > rather than an open-coded waitqueue? I tried prototyping it and it could > be done but it becomes messy regarding when the completion should be > initialised i.e only when it's not in btrfs_add_delayed_iput > Yeah a waitqueue makes more sense here than a completion since it's not a one-off. > > > > > @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info, > > * bunch of pinned space, so make sure we run the iputs before > > * we do our pinned bytes check below. > > */ > > - mutex_lock(_info->cleaner_delayed_iput_mutex); > > btrfs_run_delayed_iputs(fs_info); > > - mutex_unlock(_info->cleaner_delayed_iput_mutex); > > + btrfs_wait_on_delayed_iputs(fs_info); > > Waiting on delayed iputs here is pointless since they are run > synchronously form this context. > Unless there are other threads (the cleaner thread) running iputs as well. We could be running an iput that is evicting the inode in another thread and we really want that space, so we need to wait here to make sure everybody is truly done. Thanks, Josef
Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
On Tue, Dec 04, 2018 at 11:21:14AM +0200, Nikolay Borisov wrote: > > > On 3.12.18 г. 18:06 ч., Josef Bacik wrote: > > The cleaner thread usually takes care of delayed iputs, with the > > exception of the btrfs_end_transaction_throttle path. The cleaner > > thread only gets woken up every 30 seconds, so instead wake it up to do > > it's work so that we can free up that space as quickly as possible. > > This description misses any rationale whatsoever about why the cleaner > needs to be woken up more frequently than 30 seconds (and IMO this is > the most important question that needs answering). > Yeah I'll add that. > Also have you done any measurements of the number of processed delayed > inodes with this change. Given the behavior you so desire why not just > make delayed iputs to be run via schedule_work on the global workqueue > and be done with it? I'm sure the latency will be better than the > current 30 seconds one :) We already have the cleaner thread to do this work, and it sets up for the snapshot drop stuff to be run as well. We could probably add another delayed work thing, but I would rather do that in a different patch. Thanks, Josef
[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. The cleaner thread only gets woken up every 30 seconds, so instead wake it up to do it's work so that we can free up that space as quickly as possible. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 +++ fs/btrfs/disk-io.c | 3 +++ fs/btrfs/inode.c | 2 ++ 3 files changed, 8 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c8ddbacb6748..dc56a4d940c3 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -769,6 +769,9 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr); */ #define BTRFS_FS_BALANCE_RUNNING 18 +/* Indicate that the cleaner thread is awake and doing something. */ +#define BTRFS_FS_CLEANER_RUNNING 19 + struct btrfs_fs_info { u8 fsid[BTRFS_FSID_SIZE]; u8 chunk_tree_uuid[BTRFS_UUID_SIZE]; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c5918ff8241b..f40f6fdc1019 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1669,6 +1669,8 @@ static int cleaner_kthread(void *arg) while (1) { again = 0; + set_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags); + /* Make the cleaner go to sleep early. */ if (btrfs_need_cleaner_sleep(fs_info)) goto sleep; @@ -1715,6 +1717,7 @@ static int cleaner_kthread(void *arg) */ btrfs_delete_unused_bgs(fs_info); sleep: + clear_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags); if (kthread_should_park()) kthread_parkme(); if (kthread_should_stop()) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8ac7abe2ae9b..0b9f3e482cea 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3264,6 +3264,8 @@ void btrfs_add_delayed_iput(struct inode *inode) ASSERT(list_empty(>delayed_iput)); list_add_tail(>delayed_iput, _info->delayed_iputs); spin_unlock(_info->delayed_iput_lock); + if (!test_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags)) + wake_up_process(fs_info->cleaner_kthread); } void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info) -- 2.14.3
[PATCH 1/3] btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8dfddfd3f315..0127d272cd2a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4953,6 +4953,15 @@ static void flush_space(struct btrfs_fs_info *fs_info, ret = 0; break; case COMMIT_TRANS: + /* +* If we have pending delayed iputs then we could free up a +* bunch of pinned space, so make sure we run the iputs before +* we do our pinned bytes check below. +*/ + mutex_lock(_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(_info->cleaner_delayed_iput_mutex); + ret = may_commit_transaction(fs_info, space_info); break; default: -- 2.14.3
[PATCH 0/3][V2] Delayed iput fixes
v1->v2: - only wakeup if the cleaner isn't currently doing work. - re-arranged some stuff for running delayed iputs during flushint. - removed the open code wakeup in the waitqueue patch. -- Original message -- Here are some delayed iput fixes. Delayed iputs can hold reservations for a while and there's no real good way to make sure they were gone for good, which means we could early enospc when in reality if we had just waited for the iput we would have had plenty of space. So fix this up by making us wait for delayed iputs when deciding if we need to commit for enospc flushing, and then cleanup and rework how we run delayed iputs to make it more straightforward to wait on them and make sure we're all done using them. Thanks, Josef
[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 4 +++- fs/btrfs/disk-io.c | 5 ++--- fs/btrfs/extent-tree.c | 13 - fs/btrfs/inode.c | 21 + 4 files changed, 34 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index dc56a4d940c3..20af5d6d81f1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -915,7 +915,8 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct mutex cleaner_delayed_iput_mutex; + atomic_t nr_delayed_iputs; + wait_queue_head_t delayed_iputs_wait; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; @@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); int btrfs_prealloc_file_range(struct inode *inode, int mode, u64 start, u64 num_bytes, u64 min_size, loff_t actual_len, u64 *alloc_hint); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f40f6fdc1019..238e0113f2d3 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg) goto sleep; } - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(_info->cleaner_mutex); @@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb, mutex_init(_info->delete_unused_bgs_mutex); mutex_init(_info->reloc_mutex); mutex_init(_info->delalloc_root_mutex); - mutex_init(_info->cleaner_delayed_iput_mutex); seqlock_init(_info->profiles_lock); INIT_LIST_HEAD(_info->dirty_cowonly_roots); @@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb, atomic_set(_info->defrag_running, 0); atomic_set(_info->qgroup_op_seq, 0); atomic_set(_info->reada_works_cnt, 0); + atomic_set(_info->nr_delayed_iputs, 0); atomic64_set(_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; @@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb, init_waitqueue_head(_info->transaction_wait); init_waitqueue_head(_info->transaction_blocked_wait); init_waitqueue_head(_info->async_submit_wait); + init_waitqueue_head(_info->delayed_iputs_wait); INIT_LIST_HEAD(_info->pinned_chunks); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0127d272cd2a..5b6c9fc227ff 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) /* * The cleaner kthread might still be doing iput * operations. Wait for it to finish so that -* more space is released. +* more space is released. We don't need to +* explicitly run the delayed iputs here because +* the commit_transaction would have woken up +* the cleaner. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + ret = btrfs_wait_on_delayed_iputs(fs_info); + if (ret) + return ret; goto again; } else { btrfs_end_transaction(trans); @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info, * bunch of pinned space, so make sure we run the iputs before * we do our pinned by
[PATCH 8/8] btrfs: reserve extra space during evict()
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 623a71d871d4..8ac7abe2ae9b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; + u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1); int failures = 0; for (;;) { struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, rsv->size, + ret = btrfs_block_rsv_refill(root, rsv, +rsv->size + delayed_refs_extra, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, return ERR_PTR(-ENOSPC); } + /* +* Evict can generate a large amount of delayed refs without +* having a way to add space back since we exhaust our temporary +* block rsv. We aren't allowed to do FLUSH_ALL in this case +* because we could deadlock with so many things in the flushing +* code, so we have to try and hold some extra space to +* compensate for our delayed ref generation. If we can't get +* that space then we need see if we can steal our minimum from +* the global reserve. We will be ratelimited by the amount of +* space we have for the delayed refs rsv, so we'll end up +* committing and trying again. +*/ trans = btrfs_join_transaction(root); - if (IS_ERR(trans) || !ret) + if (IS_ERR(trans) || !ret) { + if (!IS_ERR(trans)) { + trans->block_rsv = _info->trans_block_rsv; + trans->bytes_reserved = delayed_refs_extra; + btrfs_block_rsv_migrate(rsv, trans->block_rsv, + delayed_refs_extra, 1); + } return trans; + } /* * Try to steal from the global reserve if there is space for -- 2.14.3
[PATCH 1/8] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: Nikolay Borisov --- fs/btrfs/extent-tree.c | 34 -- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 07ef1b8087f7..755eb226d32d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4853,10 +4853,19 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes_needed) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes_needed, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + /* +* See if there is enough pinned space to make this reservation, or if +* we have bg's that are going to be freed, allowing us to possibly do a +* chunk allocation the next loop through. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) || + __percpu_counter_compare(_info->total_bytes_pinned, +bytes_needed, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; /* @@ -4864,7 +4873,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4878,17 +4887,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, bytes_needed -= reclaim_bytes; if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes_needed, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - +bytes_needed, +BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; - return btrfs_commit_transaction(trans); +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /* -- 2.14.3
[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 18 +- include/trace/events/btrfs.h | 1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 30da075c042e..7cf6ad021d81 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2750,7 +2750,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 667b992d322d..2d0dd70570ca 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4938,6 +4938,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4945,7 +4946,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5081,6 +5084,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) commit_cycles--; } + /* +* We don't want to force a chunk allocation until we've tried +* pretty hard to reclaim space. Think of the case where we +* free'd up a bunch of space and so have a lot of pinned space +* to reclaim. We would rather use that than possibly create a +* underutilized metadata chunk. So if this is our first run +* through the flushing state machine skip ALLOC_CHUNK_FORCE and +* commit the transaction. If nothing has changed the next go +* around then we can force a chunk allocation. +*/ + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; + if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 63d1f9d8b8c7..dd0e6f8d6b6e 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 0/8][V2] Enospc cleanups and fixeS
v1->v2: - addressed comments from reviewers. - fixed a bug in patch 6 that was introduced because of changes to upstream. -- Original message -- The delayed refs rsv patches exposed a bunch of issues in our enospc infrastructure that needed to be addressed. These aren't really one coherent group, but they are all around flushing and reservations. may_commit_transaction() needed to be updated a little bit, and we needed to add a new state to force chunk allocation if things got dicey. Also because we can end up needed to reserve a whole bunch of extra space for outstanding delayed refs we needed to add the ability to only ENOSPC tickets that were too big to satisfy, instead of failing all of the tickets. There's also a fix in here for one of the corner cases where we didn't quite have enough space reserved for the delayed refs we were generating during evict(). Thanks, Josef
[PATCH 2/8] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: David Sterba --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 755eb226d32d..204b35434056 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8063,6 +8063,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +#define DUMP_BLOCK_RSV(fs_info, rsv_name) \ +do { \ + struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \ + spin_lock(&__rsv->lock);\ + btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \ + __rsv->size, __rsv->reserved); \ + spin_unlock(&__rsv->lock); \ +} while (0) + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -8082,6 +8091,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(>lock); + DUMP_BLOCK_RSV(fs_info, global_block_rsv); + DUMP_BLOCK_RSV(fs_info, trans_block_rsv); + DUMP_BLOCK_RSV(fs_info, chunk_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
[PATCH 6/8] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 58 +- 1 file changed, 43 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0ee77a98f867..0e1a499035ac 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root, return ret; } +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv, + u64 *metadata_bytes, u64 *qgroup_bytes) +{ + *metadata_bytes = 0; + *qgroup_bytes = 0; + + spin_lock(_rsv->lock); + if (block_rsv->reserved < block_rsv->size) + *metadata_bytes = block_rsv->size - block_rsv->reserved; + if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) + *qgroup_bytes = block_rsv->qgroup_rsv_size - + block_rsv->qgroup_rsv_reserved; + spin_unlock(_rsv->lock); +} + /** * btrfs_inode_rsv_refill - refill the inode block rsv. * @inode - the inode we are refilling. @@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = >block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; - spin_lock(_rsv->lock); - if (block_rsv->reserved < block_rsv->size) - num_bytes = block_rsv->size - block_rsv->reserved; - if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) - qgroup_num_bytes = block_rsv->qgroup_rsv_size - - block_rsv->qgroup_rsv_reserved; - spin_unlock(_rsv->lock); - + __get_refill_bytes(block_rsv, _bytes, _num_bytes); if (num_bytes == 0) return 0; - ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); - if (ret) - return ret; - ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + do { + ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); + if (ret) + return ret; + ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + if (ret) { + btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + last = num_bytes; + /* +* If we are fragmented we can end up with a lot of +* outstanding extents which will make our size be much +* larger than our reserved amount. If we happen to +* try to do a reservation here that may result in us +* trying to do a pretty hefty reservation, which we may +* not need once delalloc flushing happens. If this is +* the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL) + __get_refill_bytes(block_rsv, _bytes, + _num_bytes); + if (num_bytes == 0) + return 0; + } + } while (ret && last != num_bytes); + if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, false); trace_btrfs_space_reservation(root->fs_info, "delalloc", @@ -5830,8 +5859,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(_rsv->lock); - } else - btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + } return ret; } -- 2.14.3
[PATCH 3/8] btrfs: don't use global rsv for chunk allocation
The should_alloc_chunk code has math in it to decide if we're getting short on space and if we should go ahead and pre-emptively allocate a new chunk. Previously when we did not have the delayed_refs_rsv, we had to assume that the global block rsv was essentially used space and could be allocated completely at any time, so we counted this space as "used" when determining if we had enough slack space in our current space_info. But on any slightly used file system (10gib or more) you can have a global reserve of 512mib. With our default chunk size being 1gib that means we just assume half of the block group is used, which could result in us allocating more metadata chunks than is actually required. With the delayed refs rsv we can flush delayed refs as the space becomes tight, and if we actually need more block groups then they will be allocated based on space pressure. We no longer require assuming the global reserve is used space in our calculations. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 204b35434056..667b992d322d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4398,21 +4398,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 7/8] btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when running delalloc because we may be holding a tree lock. We can also deadlock with delayed refs rsv's that are running via the committing mechanism. The only safe operations for FLUSH_LIMIT is to run the delayed operations and to allocate chunks, everything else has the potential to deadlock. Future proof this by explicitly specifying the states that FLUSH_LIMIT is allowed to use. This will keep us from introducing bugs later on when adding new flush states. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0e1a499035ac..ab9d915d9289 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5123,12 +5123,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work) INIT_WORK(work, btrfs_async_reclaim_metadata_space); } +static const enum btrfs_flush_state priority_flush_states[] = { + FLUSH_DELAYED_ITEMS_NR, + FLUSH_DELAYED_ITEMS, + ALLOC_CHUNK, +}; + static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket) { u64 to_reclaim; - int flush_state = FLUSH_DELAYED_ITEMS_NR; + int flush_state = 0; spin_lock(_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, @@ -5140,7 +5146,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, spin_unlock(_info->lock); do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, + priority_flush_states[flush_state]); flush_state++; spin_lock(_info->lock); if (ticket->bytes == 0) { @@ -5148,15 +5155,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, return; } spin_unlock(_info->lock); - - /* -* Priority flushers can't wait on delalloc without -* deadlocking. -*/ - if (flush_state == FLUSH_DELALLOC || - flush_state == FLUSH_DELALLOC_WAIT) - flush_state = ALLOC_CHUNK; - } while (flush_state < COMMIT_TRANS); + } while (flush_state < ARRAY_SIZE(priority_flush_states)); } static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, -- 2.14.3
[PATCH 5/8] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2d0dd70570ca..0ee77a98f867 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4801,6 +4801,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -5023,7 +5024,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5032,7 +5033,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(>list); ticket->error = -ENOSPC; wake_up(>wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5100,8 +5104,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5153,10 +5161,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(_info->lock); @@ -5177,14 +5186,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(>list)) list_del_init(>list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - update_bytes_may_use(space_info, -num_bytes); - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5210,6 +5217,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5245,6 +5253,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(); @@ -5285,25 +5294,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, sp
[PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs
Now with the delayed_refs_rsv we can now know exactly how much pending delayed refs space we need. This means we can drastically simplify btrfs_check_space_for_delayed_refs by simply checking how much space we have reserved for the global rsv (which acts as a spill over buffer) and the delayed refs rsv. If our total size is beyond that amount then we know it's time to commit the transaction and stop any more delayed refs from being generated. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 2 +- fs/btrfs/extent-tree.c | 48 ++-- fs/btrfs/inode.c | 4 ++-- fs/btrfs/transaction.c | 2 +- 4 files changed, 22 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2eba398c722b..30da075c042e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, } int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans); -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans); +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, const u64 start); void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5a2d0b061f57..07ef1b8087f7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes) return num_csums; } -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans) +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_block_rsv *global_rsv; - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready; - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums; - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs; - u64 num_bytes, num_dirty_bgs_bytes; - int ret = 0; + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv; + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; + bool ret = false; + u64 reserved; - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1); - num_heads = heads_to_leaves(fs_info, num_heads); - if (num_heads > 1) - num_bytes += (num_heads - 1) * fs_info->nodesize; - num_bytes <<= 1; - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) * - fs_info->nodesize; - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info, -num_dirty_bgs); - global_rsv = _info->global_block_rsv; + spin_lock(_rsv->lock); + reserved = global_rsv->reserved; + spin_unlock(_rsv->lock); /* -* If we can't allocate any more chunks lets make sure we have _lots_ of -* wiggle room since running delayed refs can create more delayed refs. +* Since the global reserve is just kind of magic we don't really want +* to rely on it to save our bacon, so if our size is more than the +* delayed_refs_rsv and the global rsv then it's time to think about +* bailing. */ - if (global_rsv->space_info->full) { - num_dirty_bgs_bytes <<= 1; - num_bytes <<= 1; - } - - spin_lock(_rsv->lock); - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes) - ret = 1; - spin_unlock(_rsv->lock); + spin_lock(_refs_rsv->lock); + reserved += delayed_refs_rsv->reserved; + if (delayed_refs_rsv->size >= reserved) + ret = true; + spin_unlock(_refs_rsv->lock); return ret; } @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans) if (val >= NSEC_PER_SEC / 2) return 2; - return btrfs_check_space_for_delayed_refs(trans); + return btrfs_check_space_for_delayed_refs(trans->fs_info); } struct async_delayed_refs { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a097f5fde31d..8532a2eb56d1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, * Try to steal from the global reserve if there is space for * it. */ - if (!btrfs_check_space_for_delayed_refs(trans) && - !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, false)) +
[PATCH 10/10] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 79 1 file changed, 17 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8532a2eb56d1..623a71d871d4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, _info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans)) should_throttle = true; } @@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we +* can't flush, and we don't want to FLUSH_LIMIT because +* we could have generated too many delayed refs to +* actually allocate, so just bail if we're short and +
[PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic
Over the years we have built up a lot of infrastructure to keep delayed refs in check, mostly by running them at btrfs_end_transaction() time. We have a lot of different maths we do to figure out how much, if we should do it inline or async, etc. This existed because we had no feedback mechanism to force the flushing of delayed refs when they became a problem. However with the enospc flushing infrastructure in place for flushing delayed refs when they put too much pressure on the enospc system we have this problem solved. Rip out all of this code as it is no longer needed. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 38 -- 1 file changed, 38 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 2d8401bf8df9..01f39401619a 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -798,22 +798,12 @@ static int should_end_transaction(struct btrfs_trans_handle *trans) int btrfs_should_end_transaction(struct btrfs_trans_handle *trans) { struct btrfs_transaction *cur_trans = trans->transaction; - int updates; - int err; smp_mb(); if (cur_trans->state >= TRANS_STATE_BLOCKED || cur_trans->delayed_refs.flushing) return 1; - updates = trans->delayed_ref_updates; - trans->delayed_ref_updates = 0; - if (updates) { - err = btrfs_run_delayed_refs(trans, updates * 2); - if (err) /* Error code will also eval true */ - return err; - } - return should_end_transaction(trans); } @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, { struct btrfs_fs_info *info = trans->fs_info; struct btrfs_transaction *cur_trans = trans->transaction; - u64 transid = trans->transid; - unsigned long cur = trans->delayed_ref_updates; int lock = (trans->type != TRANS_JOIN_NOLOCK); int err = 0; - int must_run_delayed_refs = 0; if (refcount_read(>use_count) > 1) { refcount_dec(>use_count); @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); - - trans->delayed_ref_updates = 0; - if (!trans->sync) { - must_run_delayed_refs = - btrfs_should_throttle_delayed_refs(trans); - cur = max_t(unsigned long, cur, 32); - - /* -* don't make the caller wait if they are from a NOLOCK -* or ATTACH transaction, it will deadlock with commit -*/ - if (must_run_delayed_refs == 1 && - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH))) - must_run_delayed_refs = 2; - } - - btrfs_trans_release_metadata(trans); - trans->block_rsv = NULL; - if (!list_empty(>new_bgs)) btrfs_create_pending_block_groups(trans); @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, } kmem_cache_free(btrfs_trans_handle_cachep, trans); - if (must_run_delayed_refs) { - btrfs_async_run_delayed_refs(info, cur, transid, -must_run_delayed_refs == 1); - } return err; } -- 2.14.3
[PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c36b3a42f2bb..e3ed3507018d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(>lock); spin_unlock(_refs->lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 00/10][V2] Delayed refs rsv
v1->v2: - addressed the comments from the various reviewers. - split "introduce delayed_refs_rsv" into 5 patches. The patches are the same together as they were, just split out more logically. They can't really be bisected across in that you will likely have fun enospc failures, but they compile properly. This was done to make it easier for review. -- Original message -- This patchset changes how we do space reservations for delayed refs. We were hitting probably 20-40 enospc abort's per day in production while running delayed refs at transaction commit time. This means we ran out of space in the global reserve and couldn't easily get more space in use_block_rsv(). The global reserve has grown to cover everything we don't reserve space explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if we're running short on space and when it's time to force a commit. A failure rate of 20-40 file systems when we run hundreds of thousands of them isn't super high, but cleaning up this code will make things less ugly and more predictible. Thus the delayed refs rsv. We always know how many delayed refs we have outstanding, and although running them generates more we can use the global reserve for that spill over, which fits better into it's desired use than a full blown reservation. This first approach is to simply take how many times we're reserving space for and multiply that by 2 in order to save enough space for the delayed refs that could be generated. This is a niave approach and will probably evolve, but for now it works. With this patchset we've gone down to 2-8 failures per week. It's not perfect, there are some corner cases that still need to be addressed, but is significantly better than what we had. Thanks, Josef
[PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv
Any space used in the delayed_refs_rsv will be freed up by a transaction commit, so instead of just counting the pinned space we also need to account for any space in the delayed_refs_rsv when deciding if it will make a different to commit the transaction to satisfy our space reservation. If we have enough bytes to satisfy our reservation ticket then we are good to go, otherwise subtract out what space we would gain back by committing the transaction and compare that against the pinned space to make our decision. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index aa0a638d0263..63ff9d832867 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, { struct reserve_ticket *ticket = NULL; struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv; + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv; struct btrfs_trans_handle *trans; - u64 bytes; + u64 bytes_needed; + u64 reclaim_bytes = 0; trans = (struct btrfs_trans_handle *)current->journal_info; if (trans) @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, else if (!list_empty(_info->tickets)) ticket = list_first_entry(_info->tickets, struct reserve_ticket, list); - bytes = (ticket) ? ticket->bytes : 0; + bytes_needed = (ticket) ? ticket->bytes : 0; spin_unlock(_info->lock); - if (!bytes) + if (!bytes_needed) return 0; /* See if there is enough pinned space to make this reservation */ if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, + bytes_needed, BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, return -ENOSPC; spin_lock(_rsv->lock); - if (delayed_rsv->size > bytes) - bytes = 0; - else - bytes -= delayed_rsv->size; + reclaim_bytes += delayed_rsv->reserved; spin_unlock(_rsv->lock); + spin_lock(_refs_rsv->lock); + reclaim_bytes += delayed_refs_rsv->reserved; + spin_unlock(_refs_rsv->lock); + if (reclaim_bytes >= bytes_needed) + goto commit; + bytes_needed -= reclaim_bytes; + if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, + bytes_needed, BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { return -ENOSPC; } -- 2.14.3
[PATCH 05/10] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 14 +++- fs/btrfs/delayed-ref.c | 43 -- fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 212 + fs/btrfs/transaction.c | 37 - 5 files changed, 284 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b41ec42f405..52a87d446945 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -448,8 +448,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -812,6 +813,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2796,6 +2799,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 48725fa757a3..a198ab91879c 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -474,11 +474,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, * existing and update must have the same bytenr */ static noinline void -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, +update_existing_head_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *existing, struct btrfs_delayed_ref_head *update, int *old_ref_mod_ret) { + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + struct btrfs_fs_info *fs_info = trans->fs_info; int old_ref_mod; BUG_ON(existing->is_data != update->is_data); @@ -536,10 +539,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, * versa we need to make sure to adjust pending_csums accordingly. */ if (existing->is_data) { - if (existing->total_ref_mod >= 0 && old_ref_mod < 0) + u64 csum_leaves = + btrfs_csum_bytes_to_leaves(fs_info, + existing->num_bytes); + + if (existing->total_ref_mod >= 0 && old_ref_mod < 0) {
[PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. Fix the accounting to only be adjusted when we add/remove a ref head. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index b3e4c9fcb664..48725fa757a3 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(>add_list, >ref_add_list); atomic_inc(>num_entries); - trans->delayed_ref_updates++; spin_unlock(>lock); return ret; } -- 2.14.3
[PATCH 03/10] btrfs: cleanup extent_op handling
From: Josef Bacik The cleanup_extent_op function actually would run the extent_op if it needed running, which made the name sort of a misnomer. Change it to run_and_cleanup_extent_op, and move the actual cleanup work to cleanup_extent_op so it can be used by check_ref_cleanup() in order to unify the extent op handling. Reviewed-by: Lu Fengqi Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 35 ++- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e3ed3507018d..9f169f3c5fdb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2424,19 +2424,32 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref btrfs_delayed_ref_unlock(head); } -static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +static struct btrfs_delayed_extent_op *cleanup_extent_op( + struct btrfs_delayed_ref_head *head) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; - int ret; if (!extent_op) - return 0; - head->extent_op = NULL; + return NULL; + if (head->must_insert_reserved) { + head->extent_op = NULL; btrfs_free_delayed_extent_op(extent_op); - return 0; + return NULL; } + return extent_op; +} + +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op; + int ret; + + extent_op = cleanup_extent_op(head); + if (!extent_op) + return 0; + head->extent_op = NULL; spin_unlock(>lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2488,7 +2501,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = >transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = run_and_cleanup_extent_op(trans, head); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6990,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(>ref_tree.rb_root)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(head) != NULL) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 01/10] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 22 +++--- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9301b3ad9217..b3e4c9fcb664 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(_refs->lock); + lockdep_assert_held(>lock); + + rb_erase_cached(>href_node, _refs->href_root); + RB_CLEAR_NODE(>href_node); + atomic_dec(_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 8e20c5cb5404..d2af974f68a1 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(>mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head *btrfs_select_ref_head( struct btrfs_delayed_ref_root *delayed_refs); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d242a1174e50..c36b3a42f2bb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); - atomic_dec(_refs->num_entries); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(>mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); - atomic_dec(_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(>lock); spin_unlock(_refs->lock); -- 2.14.3
[PATCH 07/10] btrfs: add new flushing states for the delayed refs rsv
A nice thing we gain with the delayed refs rsv is the ability to flush the delayed refs on demand to deal with enospc pressure. Add states to flush delayed refs on demand, and this will allow us to remove a lot of ad-hoc work around checking to see if we should commit the transaction to run our delayed refs. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 10 ++ fs/btrfs/extent-tree.c | 14 ++ include/trace/events/btrfs.h | 2 ++ 3 files changed, 22 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 52a87d446945..2eba398c722b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2745,10 +2745,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 63ff9d832867..5a2d0b061f57 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4938,6 +4938,20 @@ static void flush_space(struct btrfs_fs_info *fs_info, shrink_delalloc(fs_info, num_bytes * 2, num_bytes, state == FLUSH_DELALLOC_WAIT); break; + case FLUSH_DELAYED_REFS_NR: + case FLUSH_DELAYED_REFS: + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + break; + } + if (state == FLUSH_DELAYED_REFS_NR) + nr = calc_reclaim_items_nr(fs_info, num_bytes); + else + nr = 0; + btrfs_run_delayed_refs(trans, nr); + btrfs_end_transaction(trans); + break; case ALLOC_CHUNK: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 8568946f491d..63d1f9d8b8c7 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1048,6 +1048,8 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_ITEMS, "FLUSH_DELAYED_ITEMS"}, \ { FLUSH_DELALLOC, "FLUSH_DELALLOC"}, \ { FLUSH_DELALLOC_WAIT, "FLUSH_DELALLOC_WAIT"}, \ + { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ + { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) -- 2.14.3
Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots
On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote: > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik wrote: > > > > From: Josef Bacik > > > > When debugging some weird extent reference bug I suspected that we were > > changing a snapshot while we were deleting it, which could explain my > > bug. This was indeed what was happening, and this patch helped me > > verify my theory. It is never correct to modify the snapshot once it's > > being deleted, so mark the root when we are deleting it and make sure we > > complain about it when it happens. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.c | 3 +++ > > fs/btrfs/ctree.h | 1 + > > fs/btrfs/extent-tree.c | 9 + > > 3 files changed, 13 insertions(+) > > > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > > index 5912a97b07a6..5f82f86085e8 100644 > > --- a/fs/btrfs/ctree.c > > +++ b/fs/btrfs/ctree.c > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct > > btrfs_trans_handle *trans, > > u64 search_start; > > int ret; > > > > + if (test_bit(BTRFS_ROOT_DELETING, >state)) > > + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being > > dropped\n"); > > Please use btrfs_warn(), it makes sure we use a consistent message > style, identifies the fs, etc. > Also, "thats" should be "that is" or "that's". > Ah yeah, I was following the other convention in there but we should probably convert all of those to btrfs_warn. I'll fix the grammer thing as well, just a leftover from the much less code of conduct friendly message I originally had there. Thanks, Josef
[PATCH 0/2] Fix aborts when dropping snapshots
With more machines running my recent delayed ref rsv patches we started seeing a spike in boxes aborting in __btrfs_free_extent with being unable to find the extent ref. The full details are in 2/2, but the summary is there's been a bug ever since the original delayed inode item stuff was introduced where it could run once the snapshot was being deleted, which will result in all sorts of extent reference shenanigans. This was tricky to hit before, but with my iput changes it's become much easier to hit on our build boxes that are heavy users of snapshot creation/deletion. Thanks, Josef
[PATCH 1/2] btrfs: catch cow on deleting snapshots
From: Josef Bacik When debugging some weird extent reference bug I suspected that we were changing a snapshot while we were deleting it, which could explain my bug. This was indeed what was happening, and this patch helped me verify my theory. It is never correct to modify the snapshot once it's being deleted, so mark the root when we are deleting it and make sure we complain about it when it happens. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.c | 3 +++ fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 9 + 3 files changed, 13 insertions(+) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 5912a97b07a6..5f82f86085e8 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, u64 search_start; int ret; + if (test_bit(BTRFS_ROOT_DELETING, >state)) + WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being dropped\n"); + if (trans->transaction != fs_info->running_transaction) WARN(1, KERN_CRIT "trans %llu running %llu\n", trans->transid, diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index facde70c15ed..5a3a94ccb65c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1199,6 +1199,7 @@ enum { BTRFS_ROOT_FORCE_COW, BTRFS_ROOT_MULTI_LOG_TASKS, BTRFS_ROOT_DIRTY, + BTRFS_ROOT_DELETING, }; /* diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 581c2a0b2945..dcb699dd57f3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root, if (block_rsv) trans->block_rsv = block_rsv; + /* +* This will help us catch people modifying the fs tree while we're +* dropping it. It is unsafe to mess with the fs tree while it's being +* dropped as we unlock the root node and parent nodes as we walk down +* the tree, assuming nothing will change. If something does change +* then we'll have stale information and drop references to blocks we've +* already dropped. +*/ + set_bit(BTRFS_ROOT_DELETING, >state); if (btrfs_disk_key_objectid(_item->drop_progress) == 0) { level = btrfs_header_level(root->node); path->nodes[level] = btrfs_lock_root_node(root); -- 2.14.3
[PATCH 2/2] btrfs: run delayed items before dropping the snapshot
From: Josef Bacik With my delayed refs patches in place we started seeing a large amount of aborts in __btrfs_free_extent BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0 Call Trace: ? btrfs_merge_delayed_refs+0xaf/0x340 __btrfs_run_delayed_refs+0x6ea/0xfc0 ? btrfs_set_path_blocking+0x31/0x60 btrfs_run_delayed_refs+0xeb/0x180 btrfs_commit_transaction+0x179/0x7f0 ? btrfs_check_space_for_delayed_refs+0x30/0x50 ? should_end_transaction.isra.19+0xe/0x40 btrfs_drop_snapshot+0x41c/0x7c0 btrfs_clean_one_deleted_snapshot+0xb5/0xd0 cleaner_kthread+0xf6/0x120 kthread+0xf8/0x130 ? btree_invalidatepage+0x90/0x90 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 This was because btrfs_drop_snapshot depends on the root not being modified while it's dropping the snapshot. It will unlock the root node (and really every node) as it walks down the tree, only to re-lock it when it needs to do something. This is a problem because if we modify the tree we could cow a block in our path, which free's our reference to that block. Then once we get back to that shared block we'll free our reference to it again, and get ENOENT when trying to lookup our extent reference to that block in __btrfs_free_extent. This is ultimately happening because we have delayed items left to be processed for our deleted snapshot _after_ all of the inodes are closed for the snapshot. We only run the delayed inode item if we're deleting the inode, and even then we do not run the delayed insertions or delayed removals. These can be run at any point after our final inode does it's last iput, which is what triggers the snapshot deletion. We can end up with the snapshot deletion happening and then have the delayed items run on that file system, resulting in the above problem. This problem has existed forever, however my patches made it much easier to hit as I wake up the cleaner much more often to deal with delayed iputs, which made us more likely to start the snapshot dropping work before the transaction commits, which is when the delayed items would generally be run. Before, generally speaking, we would run the delayed items, commit the transaction, and wakeup the cleaner thread to start deleting snapshots, which means we were less likely to hit this problem. You could still hit it if you had multiple snapshots to be deleted and ended up with lots of delayed items, but it was definitely harder. Fix for now by simply running all the delayed items before starting to drop the snapshot. We could make this smarter in the future by making the delayed items per-root, and then simply drop any delayed items for roots that we are going to delete. But for now just a quick and easy solution is the safest. Cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dcb699dd57f3..965702034b22 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, goto out_free; } + btrfs_run_delayed_items(trans); + if (block_rsv) trans->block_rsv = block_rsv; -- 2.14.3
Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
On Tue, Nov 27, 2018 at 07:59:42PM +, Chris Mason wrote: > On 27 Nov 2018, at 14:54, Josef Bacik wrote: > > > On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote: > >> > >> > >> On 21.11.18 г. 21:09 ч., Josef Bacik wrote: > >>> The cleaner thread usually takes care of delayed iputs, with the > >>> exception of the btrfs_end_transaction_throttle path. The cleaner > >>> thread only gets woken up every 30 seconds, so instead wake it up to > >>> do > >>> it's work so that we can free up that space as quickly as possible. > >> > >> Have you done any measurements how this affects the overall system. I > >> suspect this introduces a lot of noise since now we are going to be > >> doing a thread wakeup on every iput, does this give a chance to have > >> nice, large batches of iputs that the cost of wake up can be > >> amortized > >> across? > > > > I ran the whole patchset with our A/B testing stuff and the patchset > > was a 5% > > win overall, so I'm inclined to think it's fine. Thanks, > > It's a good point though, a delayed wakeup may be less overhead. Sure, but how do we go about that without it sometimes messing up? In practice the only time we're doing this is at the end of finish_ordered_io, so likely to not be a constant stream. I suppose since we have places where we force it to run that we don't really need this. IDK, I'm fine with dropping it. Thanks, Josef
Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
On Tue, Nov 27, 2018 at 10:29:57AM +0200, Nikolay Borisov wrote: > > > On 21.11.18 г. 21:09 ч., Josef Bacik wrote: > > The throttle path doesn't take cleaner_delayed_iput_mutex, which means > > Which one is the throttle path? btrfs_end_transaction_throttle is only > called during snapshot drop and relocation? What scenario triggered your > observation and prompted this fix? > One of my enospc tests runs snapshot creation/deletion in the background. > > we could think we're done flushing iputs in the data space reservation > > path when we could have a throttler doing an iput. There's no real > > reason to serialize the delayed iput flushing, so instead of taking the > > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just > > replace it with an atomic counter and a waitqueue. This removes the > > short (or long depending on how big the inode is) window where we think > > there are no more pending iputs when there really are some. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.h | 4 +++- > > fs/btrfs/disk-io.c | 5 ++--- > > fs/btrfs/extent-tree.c | 9 + > > fs/btrfs/inode.c | 21 + > > 4 files changed, 31 insertions(+), 8 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 709de7471d86..a835fe7076eb 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -912,7 +912,8 @@ struct btrfs_fs_info { > > > > spinlock_t delayed_iput_lock; > > struct list_head delayed_iputs; > > - struct mutex cleaner_delayed_iput_mutex; > > + atomic_t nr_delayed_iputs; > > + wait_queue_head_t delayed_iputs_wait; > > > > /* this protects tree_mod_seq_list */ > > spinlock_t tree_mod_seq_lock; > > @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); > > int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); > > void btrfs_add_delayed_iput(struct inode *inode); > > void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); > > +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); > > int btrfs_prealloc_file_range(struct inode *inode, int mode, > > u64 start, u64 num_bytes, u64 min_size, > > loff_t actual_len, u64 *alloc_hint); > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > index c5918ff8241b..3f81dfaefa32 100644 > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg) > > goto sleep; > > } > > > > - mutex_lock(_info->cleaner_delayed_iput_mutex); > > btrfs_run_delayed_iputs(fs_info); > > - mutex_unlock(_info->cleaner_delayed_iput_mutex); > > > > again = btrfs_clean_one_deleted_snapshot(root); > > mutex_unlock(_info->cleaner_mutex); > > @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb, > > mutex_init(_info->delete_unused_bgs_mutex); > > mutex_init(_info->reloc_mutex); > > mutex_init(_info->delalloc_root_mutex); > > - mutex_init(_info->cleaner_delayed_iput_mutex); > > seqlock_init(_info->profiles_lock); > > > > INIT_LIST_HEAD(_info->dirty_cowonly_roots); > > @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb, > > atomic_set(_info->defrag_running, 0); > > atomic_set(_info->qgroup_op_seq, 0); > > atomic_set(_info->reada_works_cnt, 0); > > + atomic_set(_info->nr_delayed_iputs, 0); > > atomic64_set(_info->tree_mod_seq, 0); > > fs_info->sb = sb; > > fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; > > @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb, > > init_waitqueue_head(_info->transaction_wait); > > init_waitqueue_head(_info->transaction_blocked_wait); > > init_waitqueue_head(_info->async_submit_wait); > > + init_waitqueue_head(_info->delayed_iputs_wait); > > > > INIT_LIST_HEAD(_info->pinned_chunks); > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 3a90dc1d6b31..36f43876be22 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct > > btrfs_inode *inode, u64 bytes) > > * operations. Wait for it to finish so that > > * more space is released. > >
Re: [PATCH] btrfs: only run delayed refs if we're committing
On Tue, Nov 27, 2018 at 07:43:39PM +, Filipe Manana wrote: > On Tue, Nov 27, 2018 at 7:22 PM Josef Bacik wrote: > > > > On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote: > > > On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik wrote: > > > > > > > > I noticed in a giant dbench run that we spent a lot of time on lock > > > > contention while running transaction commit. This is because dbench > > > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and > > > > they all run the delayed refs first thing, so they all contend with > > > > each other. This leads to seconds of 0 throughput. Change this to only > > > > run the delayed refs if we're the ones committing the transaction. This > > > > makes the latency go away and we get no more lock contention. > > > > > > Can you share the following in the changelog please? > > > > > > 1) How did you ran dbench (parameters, config). > > > > > > 2) What results did you get before and after this change. So that we all > > > get > > > an idea of how good the impact is. > > > > > > While the reduced contention makes all sense and seems valid, I'm not > > > sure this is always a win. > > > It certainly is when multiple tasks are calling > > > btrfs_commit_transaction() simultaneously, but, > > > what about when only one does it? > > > > > > By running all delayed references inside the critical section of the > > > transaction commit > > > (state == TRANS_STATE_COMMIT_START), instead of running most of them > > > outside/before, > > > we will be blocking for a longer a time other tasks calling > > > btrfs_start_transaction() (which is used > > > a lot - creating files, unlinking files, adding links, etc, and even > > > fsync). > > > > > > Won't there by any other types of workload and tests other then dbench > > > that can get increased > > > latency and/or smaller throughput? > > > > > > I find that sort of information useful to have in the changelog. If > > > you verified that or you think > > > it's irrelevant to measure/consider, it would be great to have it > > > mentioned in the changelog > > > (and explained). > > > > > > > Yeah I thought about the delayed refs being run in the critical section now, > > that's not awesome. I'll drop this for now, I think just having a mutex > > around > > running delayed refs will be good enough, since we want people who care > > about > > flushing delayed refs to wait around for that to finish happening. Thanks, > > Well, I think we can have a solution that doesn't bring such trade-off > nor introducing a mutex. > We could do like what is currently done for writing space caches, to > make sure only the first task > calling commit transaction does the work and all others do nothing > except waiting for the commit to finish: > > btrfs_commit_transaction() >if (!test_and_set_bit(BTRFS_TRANS_COMMIT_START, _trans->flags)) { >run delayed refs before entering critical section >} > That was my first inclination but then one of the other committers goes past this and gets into the start TRANS_STATE_COMMIT_START place and has to wait for the guy running the delayed refs to finish before moving forward, essentially extending the time in the critical section for the same reason. The mutex means that if 9000 people try to commit the transaction, 1 guy gets to run the delayed refs and everybody else waits for him to finish, and in the meantime the critical section is small. Thanks, Josef
Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote: > > > On 21.11.18 г. 21:09 ч., Josef Bacik wrote: > > The cleaner thread usually takes care of delayed iputs, with the > > exception of the btrfs_end_transaction_throttle path. The cleaner > > thread only gets woken up every 30 seconds, so instead wake it up to do > > it's work so that we can free up that space as quickly as possible. > > Have you done any measurements how this affects the overall system. I > suspect this introduces a lot of noise since now we are going to be > doing a thread wakeup on every iput, does this give a chance to have > nice, large batches of iputs that the cost of wake up can be amortized > across? I ran the whole patchset with our A/B testing stuff and the patchset was a 5% win overall, so I'm inclined to think it's fine. Thanks, Josef
Re: [PATCH 5/8] btrfs: don't enospc all tickets on flush failure
On Mon, Nov 26, 2018 at 02:25:52PM +0200, Nikolay Borisov wrote: > > > On 21.11.18 г. 21:03 ч., Josef Bacik wrote: > > With the introduction of the per-inode block_rsv it became possible to > > have really really large reservation requests made because of data > > fragmentation. Since the ticket stuff assumed that we'd always have > > relatively small reservation requests it just killed all tickets if we > > were unable to satisfy the current request. However this is generally > > not the case anymore. So fix this logic to instead see if we had a > > ticket that we were able to give some reservation to, and if we were > > continue the flushing loop again. Likewise we make the tickets use the > > space_info_add_old_bytes() method of returning what reservation they did > > receive in hopes that it could satisfy reservations down the line. > > > The logic of the patch can be summarised as follows: > > If no progress is made for a ticket, then start fail all tickets until > the first one that has progress made on its reservation (inclusive). In > this case this first ticket will be failed but at least it's space will > be reused via space_info_add_old_bytes. > > Frankly this seem really arbitrary. It's not though. The tickets are in order of who requested the reservation. Because we will backfill reservations for things like hugely fragmented files or large amounts of delayed refs we can have spikes where we're trying to reserve 100mb's of metadata space. We may fill 50mb of that before we run out of space. Well so we can't satisfy that reservation, but the small 100k reservations that are waiting to be serviced can be satisfied and they can run. The alternative is you get ENOSPC and then you can turn around and touch a file no problem because it's a small reservation and there was room for it. This patch enables better behavior for the user. Thanks, Josef
Re: [PATCH] btrfs: only run delayed refs if we're committing
On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote: > On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik wrote: > > > > I noticed in a giant dbench run that we spent a lot of time on lock > > contention while running transaction commit. This is because dbench > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and > > they all run the delayed refs first thing, so they all contend with > > each other. This leads to seconds of 0 throughput. Change this to only > > run the delayed refs if we're the ones committing the transaction. This > > makes the latency go away and we get no more lock contention. > > Can you share the following in the changelog please? > > 1) How did you ran dbench (parameters, config). > > 2) What results did you get before and after this change. So that we all get > an idea of how good the impact is. > > While the reduced contention makes all sense and seems valid, I'm not > sure this is always a win. > It certainly is when multiple tasks are calling > btrfs_commit_transaction() simultaneously, but, > what about when only one does it? > > By running all delayed references inside the critical section of the > transaction commit > (state == TRANS_STATE_COMMIT_START), instead of running most of them > outside/before, > we will be blocking for a longer a time other tasks calling > btrfs_start_transaction() (which is used > a lot - creating files, unlinking files, adding links, etc, and even fsync). > > Won't there by any other types of workload and tests other then dbench > that can get increased > latency and/or smaller throughput? > > I find that sort of information useful to have in the changelog. If > you verified that or you think > it's irrelevant to measure/consider, it would be great to have it > mentioned in the changelog > (and explained). > Yeah I thought about the delayed refs being run in the critical section now, that's not awesome. I'll drop this for now, I think just having a mutex around running delayed refs will be good enough, since we want people who care about flushing delayed refs to wait around for that to finish happening. Thanks, Josef
Re: [PATCH 5/6] btrfs: introduce delayed_refs_rsv
On Mon, Nov 26, 2018 at 11:14:12AM +0200, Nikolay Borisov wrote: > > > On 21.11.18 г. 20:59 ч., Josef Bacik wrote: > > From: Josef Bacik > > > > Traditionally we've had voodoo in btrfs to account for the space that > > delayed refs may take up by having a global_block_rsv. This works most > > of the time, except when it doesn't. We've had issues reported and seen > > in production where sometimes the global reserve is exhausted during > > transaction commit before we can run all of our delayed refs, resulting > > in an aborted transaction. Because of this voodoo we have equally > > dubious flushing semantics around throttling delayed refs which we often > > get wrong. > > > > So instead give them their own block_rsv. This way we can always know > > exactly how much outstanding space we need for delayed refs. This > > allows us to make sure we are constantly filling that reservation up > > with space, and allows us to put more precise pressure on the enospc > > system. Instead of doing math to see if its a good time to throttle, > > the normal enospc code will be invoked if we have a lot of delayed refs > > pending, and they will be run via the normal flushing mechanism. > > > > For now the delayed_refs_rsv will hold the reservations for the delayed > > refs, the block group updates, and deleting csums. We could have a > > separate rsv for the block group updates, but the csum deletion stuff is > > still handled via the delayed_refs so that will stay there. > > Couldn't the same "no premature ENOSPC in critical section" effect be > achieved if we simply allocate 2* num bytes in start transaction without > adding the additional granularity for delayd refs? There seems to be a > lot of supporting code added and this obfuscates the simple idea that > now we do 2* reservation and put it in a separate block_rsv structure. > > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.h | 26 ++-- > > fs/btrfs/delayed-ref.c | 28 - > > fs/btrfs/disk-io.c | 4 + > > fs/btrfs/extent-tree.c | 279 > > +++ > > fs/btrfs/inode.c | 4 +- > > fs/btrfs/transaction.c | 77 ++-- > > include/trace/events/btrfs.h | 2 + > > 7 files changed, 313 insertions(+), 107 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 8b41ec42f405..0c6d589c8ce4 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -448,8 +448,9 @@ struct btrfs_space_info { > > #defineBTRFS_BLOCK_RSV_TRANS 3 > > #defineBTRFS_BLOCK_RSV_CHUNK 4 > > #defineBTRFS_BLOCK_RSV_DELOPS 5 > > -#defineBTRFS_BLOCK_RSV_EMPTY 6 > > -#defineBTRFS_BLOCK_RSV_TEMP7 > > +#define BTRFS_BLOCK_RSV_DELREFS6 > > +#defineBTRFS_BLOCK_RSV_EMPTY 7 > > +#defineBTRFS_BLOCK_RSV_TEMP8 > > > > struct btrfs_block_rsv { > > u64 size; > > @@ -812,6 +813,8 @@ struct btrfs_fs_info { > > struct btrfs_block_rsv chunk_block_rsv; > > /* block reservation for delayed operations */ > > struct btrfs_block_rsv delayed_block_rsv; > > + /* block reservation for delayed refs */ > > + struct btrfs_block_rsv delayed_refs_rsv; > > > > struct btrfs_block_rsv empty_block_rsv; > > > > @@ -2628,7 +2631,7 @@ static inline u64 > > btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, > > } > > > > int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans); > > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans); > > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); > > void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, > > const u64 start); > > void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache > > *bg); > > @@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum { > > enum btrfs_flush_state { > > FLUSH_DELAYED_ITEMS_NR = 1, > > FLUSH_DELAYED_ITEMS = 2, > > - FLUSH_DELALLOC = 3, > > - FLUSH_DELALLOC_WAIT = 4, > > - ALLOC_CHUNK = 5, > > - COMMIT_TRANS= 6, > > + FLUSH_DELAYED_REFS_NR = 3, > > + FLUSH_DELAYED_REFS = 4, > > + FLUSH_DELALLOC = 5, > > + FLUSH_DELALLOC_WAIT =
Re: [PATCH 3/6] btrfs: cleanup extent_op handling
On Thu, Nov 22, 2018 at 12:09:34PM +0200, Nikolay Borisov wrote: > > > On 21.11.18 г. 20:59 ч., Josef Bacik wrote: > > From: Josef Bacik > > > > The cleanup_extent_op function actually would run the extent_op if it > > needed running, which made the name sort of a misnomer. Change it to > > run_and_cleanup_extent_op, and move the actual cleanup work to > > cleanup_extent_op so it can be used by check_ref_cleanup() in order to > > unify the extent op handling. > > The whole name extent_op is actually a misnomer since AFAIR this is some > sort of modification of the references of metadata nodes. I don't see > why it can't be made as yet another type of reference which is run for a > given node. > It would change the key for a metadata extent reference for non-skinny metadata, and it sets the FULL_BACKREF flag. Since it really only changes flags now we could probably roll that into it's own thing, but that's out of scope for this stuff. Thanks, Josef
Re: [PATCH] btrfs: Remove unnecessary code from __btrfs_rebalance
On Thu, Nov 22, 2018 at 10:17:33AM +0200, Nikolay Borisov wrote: > The first step fo the rebalance process, ensuring there is 1mb free on > each device. This number seems rather small. And in fact when talking > to the original authors their opinions were: > > "man that's a little bonkers" > "i don't think we even need that code anymore" > "I think it was there to make sure we had room for the blank 1M at the > beginning. I bet it goes all the way back to v0" > "we just don't need any of that tho, i say we just delete it" > > Clearly, this piece of code has lost its original intent throughout > the years. It doesn't really bring any real practical benefits to the > relocation process. No functional changes. > > Signed-off-by: Nikolay Borisov > Suggested-by: Josef Bacik Reviewed-by: Josef Bacik Thanks, Josef
[PATCH 1/3] btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 90423b6749b7..3a90dc1d6b31 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4833,6 +4833,15 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; + /* +* If we have pending delayed iputs then we could free up a bunch of +* pinned space, so make sure we run the iputs before we do our pinned +* bytes check below. +*/ + mutex_lock(_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(_info->cleaner_delayed_iput_mutex); + trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) return PTR_ERR(trans); -- 2.14.3
[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. The cleaner thread only gets woken up every 30 seconds, so instead wake it up to do it's work so that we can free up that space as quickly as possible. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3da9ac463344..3c42d8887183 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3264,6 +3264,7 @@ void btrfs_add_delayed_iput(struct inode *inode) ASSERT(list_empty(>delayed_iput)); list_add_tail(>delayed_iput, _info->delayed_iputs); spin_unlock(_info->delayed_iput_lock); + wake_up_process(fs_info->cleaner_kthread); } void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info) -- 2.14.3
[PATCH] btrfs: only run delayed refs if we're committing
I noticed in a giant dbench run that we spent a lot of time on lock contention while running transaction commit. This is because dbench results in a lot of fsync()'s that do a btrfs_transaction_commit(), and they all run the delayed refs first thing, so they all contend with each other. This leads to seconds of 0 throughput. Change this to only run the delayed refs if we're the ones committing the transaction. This makes the latency go away and we get no more lock contention. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 3c1be9db897c..41cc96cc59a3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - /* make a pass through all the delayed refs we have so far -* any runnings procs may add more while we are here -*/ - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - cur_trans = trans->transaction; /* @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_create_pending_block_groups(trans); - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { int run_it = 0; @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_unlock(_info->trans_lock); } + /* +* We are now the only one in the commit area, we can run delayed refs +* without hitting a bunch of lock contention from a lot of people +* trying to commit the transaction at once. +*/ + ret = btrfs_run_delayed_refs(trans, 0); + if (ret) + goto cleanup_transaction; + extwriter_counter_dec(cur_trans, trans->type); ret = btrfs_start_delalloc_flush(fs_info); -- 2.14.3
[PATCH 0/3] Delayed iput fixes
Here are some delayed iput fixes. Delayed iputs can hold reservations for a while and there's no real good way to make sure they were gone for good, which means we could early enospc when in reality if we had just waited for the iput we would have had plenty of space. So fix this up by making us wait for delayed iputs when deciding if we need to commit for enospc flushing, and then cleanup and rework how we run delayed iputs to make it more straightforward to wait on them and make sure we're all done using them. Thanks, Josef
[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 4 +++- fs/btrfs/disk-io.c | 5 ++--- fs/btrfs/extent-tree.c | 9 + fs/btrfs/inode.c | 21 + 4 files changed, 31 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 709de7471d86..a835fe7076eb 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -912,7 +912,8 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct mutex cleaner_delayed_iput_mutex; + atomic_t nr_delayed_iputs; + wait_queue_head_t delayed_iputs_wait; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); int btrfs_prealloc_file_range(struct inode *inode, int mode, u64 start, u64 num_bytes, u64 min_size, loff_t actual_len, u64 *alloc_hint); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c5918ff8241b..3f81dfaefa32 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg) goto sleep; } - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(_info->cleaner_mutex); @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb, mutex_init(_info->delete_unused_bgs_mutex); mutex_init(_info->reloc_mutex); mutex_init(_info->delalloc_root_mutex); - mutex_init(_info->cleaner_delayed_iput_mutex); seqlock_init(_info->profiles_lock); INIT_LIST_HEAD(_info->dirty_cowonly_roots); @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb, atomic_set(_info->defrag_running, 0); atomic_set(_info->qgroup_op_seq, 0); atomic_set(_info->reada_works_cnt, 0); + atomic_set(_info->nr_delayed_iputs, 0); atomic64_set(_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb, init_waitqueue_head(_info->transaction_wait); init_waitqueue_head(_info->transaction_blocked_wait); init_waitqueue_head(_info->async_submit_wait); + init_waitqueue_head(_info->delayed_iputs_wait); INIT_LIST_HEAD(_info->pinned_chunks); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3a90dc1d6b31..36f43876be22 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * operations. Wait for it to finish so that * more space is released. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + ret = btrfs_wait_on_delayed_iputs(fs_info); + if (ret) + return ret; goto again; } else { btrfs_end_transaction(trans); @@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * pinned space, so make sure we run the iputs before we do our pinned * bytes check below. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + wait_event(fs_info->delayed_iputs_wait, + atomic_read(_info->nr_delayed_iputs) == 0); trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) diff --git a/fs/btrfs/inode.c b/fs/b
[PATCH 4/7] btrfs: call btrfs_create_pending_block_groups unconditionally
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 3 +-- fs/btrfs/transaction.c | 6 ++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a0f8880ee5dd..b9b829c8825c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3004,8 +3004,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } if (run_all) { - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); spin_lock(_refs->lock); node = rb_first_cached(_refs->href_root); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a4682a696fb6..826a15a07fce 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -845,8 +845,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); @@ -1937,8 +1936,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) cur_trans->delayed_refs.flushing = 1; smp_wmb(); - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); ret = btrfs_run_delayed_refs(trans, 0); if (ret) { -- 2.14.3
[PATCH 6/7] btrfs: cleanup pending bgs on transaction abort
We may abort the transaction during a commit and not have a chance to run the pending bgs stuff, which will leave block groups on our list and cause us accounting issues and leaked memory. Fix this by running the pending bgs when we cleanup a transaction. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 826a15a07fce..3c1be9db897c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2269,6 +2269,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_scrub_continue(fs_info); cleanup_transaction: btrfs_trans_release_metadata(trans); + /* This cleans up the pending block groups list properly. */ + if (!trans->aborted) + trans->aborted = ret; + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); trans->block_rsv = NULL; btrfs_warn(fs_info, "Skipping commit of aborted transaction."); -- 2.14.3
[PATCH 5/7] btrfs: just delete pending bgs if we are aborted
We still need to do all of the accounting cleanup for pending block groups if we abort. So set the ret to trans->aborted so if we aborted the cleanup happens and everybody is happy. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b9b829c8825c..90423b6749b7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10500,11 +10500,17 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; - int ret = 0; + int ret; if (!trans->can_flush_pending_bgs) return; + /* +* If we aborted the transaction with pending bg's we need to just +* cleanup the list and carry on. +*/ + ret = trans->aborted; + while (!list_empty(>new_bgs)) { block_group = list_first_entry(>new_bgs, struct btrfs_block_group_cache, -- 2.14.3
[PATCH 7/7] btrfs: wait on ordered extents on abort cleanup
If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8e7926c91e35..c5918ff8241b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4171,6 +4171,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) spin_lock(_info->ordered_root_lock); } spin_unlock(_info->ordered_root_lock); + + /* +* We need this here because if we've been flipped read-only we won't +* get sync() from the umount, so we need to make sure any ordered +* extents that haven't had their dirty pages IO start writeout yet +* actually get run and error out properly. +*/ + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, -- 2.14.3
[PATCH 0/7] Abort cleanup fixes
A new xfstests that really hammers on transaction aborts (generic/495 I think?) uncovered a lot of random issues. Some of these were introduced with the new delayed refs rsv patches, others were just exposed by them, such as the pending bg stuff. With these patches in place I stopped getting all the random leftovers and WARN_ON()'s when running whichever xfstest that was and things are much smoother now. Thanks, Josef
[PATCH 3/7] btrfs: handle delayed ref head accounting cleanup in abort
We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 5 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 13 ++--- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8ccc5019172b..709de7471d86 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -35,6 +35,7 @@ struct btrfs_trans_handle; struct btrfs_transaction; struct btrfs_pending_snapshot; +struct btrfs_delayed_ref_root; extern struct kmem_cache *btrfs_trans_handle_cachep; extern struct kmem_cache *btrfs_bit_radix_cachep; extern struct kmem_cache *btrfs_path_cachep; @@ -2643,6 +2644,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long count); int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, unsigned long count, u64 transid, int wait); +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7d02748cf3f6..8e7926c91e35 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4223,6 +4223,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (pin_bytes) btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); btrfs_put_delayed_ref_head(head); cond_resched(); spin_lock(_refs->lock); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e31980d451c2..a0f8880ee5dd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2457,12 +2457,11 @@ static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, - struct btrfs_delayed_ref_head *head) +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_delayed_ref_root *delayed_refs = - >transaction->delayed_refs; int nr_items = 1; if (head->total_ref_mod < 0) { @@ -2540,7 +2539,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); @@ -7218,7 +7217,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 2/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
Instead of open coding this stuff use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f062fb0487cd..7d02748cf3f6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4215,12 +4215,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (head->must_insert_reserved) pin_bytes = true; btrfs_free_delayed_extent_op(head->extent_op); - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; - atomic_dec(_refs->num_entries); - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); mutex_unlock(>mutex); -- 2.14.3
[PATCH 1/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
We have this open coded in btrfs_destroy_delayed_refs, use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index beaa58e742b5..f062fb0487cd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4197,16 +4197,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); - if (!mutex_trylock(>mutex)) { - refcount_inc(>refs); - spin_unlock(_refs->lock); - - mutex_lock(>mutex); - mutex_unlock(>mutex); - btrfs_put_delayed_ref_head(head); - spin_lock(_refs->lock); + if (btrfs_delayed_ref_lock(delayed_refs, head)) continue; - } + spin_lock(>lock); while ((n = rb_first_cached(>ref_tree)) != NULL) { ref = rb_entry(n, struct btrfs_delayed_ref_node, -- 2.14.3
[PATCH 3/8] btrfs: don't use global rsv for chunk allocation
We've done this forever because of the voodoo around knowing how much space we have. However we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7a30fbc05e5e..a91b3183dcae 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4388,21 +4388,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 18 +- include/trace/events/btrfs.h | 1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0c6d589c8ce4..8ccc5019172b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2750,7 +2750,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a91b3183dcae..e6bb6ce23c84 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) commit_cycles--; } + /* +* We don't want to force a chunk allocation until we've tried +* pretty hard to reclaim space. Think of the case where we +* free'd up a bunch of space and so have a lot of pinned space +* to reclaim. We would rather use that than possibly create a +* underutilized metadata chunk. So if this is our first run +* through the flushing state machine skip ALLOC_CHUNK_FORCE and +* commit the transaction. If nothing has changed the next go +* around then we can force a chunk allocation. +*/ + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; + if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 63d1f9d8b8c7..dd0e6f8d6b6e 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 5/8] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e6bb6ce23c84..983d086fa768 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4791,6 +4791,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -5012,7 +5013,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5021,7 +5022,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(>list); ticket->error = -ENOSPC; wake_up(>wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5089,8 +5093,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5142,10 +5150,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(_info->lock); @@ -5166,14 +5175,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(>list)) list_del_init(>list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - update_bytes_may_use(space_info, -num_bytes); - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5199,6 +5206,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5234,6 +5242,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(); @@ -5274,25 +5283,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, sp
[PATCH 7/8] btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers we really can only allocate chunks and flush delayed inode items, everything else is problematic. I added a bunch of new states and it lead to weirdness in the FLUSH_LIMIT case because I forgot about how it worked. So instead explicitly declare the states that are ok for flushing with FLUSH_LIMIT and use that for our state machine. Then as we add new things that are safe we can just add them to this list. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0e9ba77e5316..e31980d451c2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5112,12 +5112,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work) INIT_WORK(work, btrfs_async_reclaim_metadata_space); } +static const enum btrfs_flush_state priority_flush_states[] = { + FLUSH_DELAYED_ITEMS_NR, + FLUSH_DELAYED_ITEMS, + ALLOC_CHUNK, +}; + static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket) { u64 to_reclaim; - int flush_state = FLUSH_DELAYED_ITEMS_NR; + int flush_state = 0; spin_lock(_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, @@ -5129,7 +5135,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, spin_unlock(_info->lock); do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, + priority_flush_states[flush_state]); flush_state++; spin_lock(_info->lock); if (ticket->bytes == 0) { @@ -5137,15 +5144,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, return; } spin_unlock(_info->lock); - - /* -* Priority flushers can't wait on delalloc without -* deadlocking. -*/ - if (flush_state == FLUSH_DELALLOC || - flush_state == FLUSH_DELALLOC_WAIT) - flush_state = ALLOC_CHUNK; - } while (flush_state < COMMIT_TRANS); + } while (flush_state < ARRAY_SIZE(priority_flush_states)); } static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, -- 2.14.3
[PATCH 0/8] Enospc cleanups and fixes
The delayed refs rsv patches exposed a bunch of issues in our enospc infrastructure that needed to be addressed. These aren't really one coherent group, but they are all around flushing and reservations. may_commit_transaction() needed to be updated a little bit, and we needed to add a new state to force chunk allocation if things got dicey. Also because we can end up needed to reserve a whole bunch of extra space for outstanding delayed refs we needed to add the ability to only ENOSPC tickets that were too big to satisfy, instead of failing all of the tickets. There's also a fix in here for one of the corner cases where we didn't quite have enough space reserved for the delayed refs we were generating during evict(). Thanks, Josef
[PATCH 6/8] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 56 -- 1 file changed, 41 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 983d086fa768..0e9ba77e5316 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5776,6 +5776,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root, return ret; } +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv, + u64 *metadata_bytes, u64 *qgroup_bytes) +{ + *metadata_bytes = 0; + *qgroup_bytes = 0; + + spin_lock(_rsv->lock); + if (block_rsv->reserved < block_rsv->size) + *metadata_bytes = block_rsv->size - block_rsv->reserved; + if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) + *qgroup_bytes = block_rsv->qgroup_rsv_size - + block_rsv->qgroup_rsv_reserved; + spin_unlock(_rsv->lock); +} + /** * btrfs_inode_rsv_refill - refill the inode block rsv. * @inode - the inode we are refilling. @@ -5791,25 +5806,37 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = >block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; - spin_lock(_rsv->lock); - if (block_rsv->reserved < block_rsv->size) - num_bytes = block_rsv->size - block_rsv->reserved; - if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) - qgroup_num_bytes = block_rsv->qgroup_rsv_size - - block_rsv->qgroup_rsv_reserved; - spin_unlock(_rsv->lock); - + __get_refill_bytes(block_rsv, _bytes, _num_bytes); if (num_bytes == 0) return 0; - ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); - if (ret) - return ret; - ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + do { + ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); + if (ret) + return ret; + ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + if (ret) { + btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + last = num_bytes; + /* +* If we are fragmented we can end up with a lot of +* outstanding extents which will make our size be much +* larger than our reserved amount. If we happen to +* try to do a reservation here that may result in us +* trying to do a pretty hefty reservation, which we may +* not need once delalloc flushing happens. If this is +* the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL) + __get_refill_bytes(block_rsv, _bytes, + _num_bytes); + } + } while (ret && last != num_bytes); + if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, false); trace_btrfs_space_reservation(root->fs_info, "delalloc", @@ -5819,8 +5846,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(_rsv->lock); - } else - btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + } return ret; } -- 2.14.3
[PATCH 2/8] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: David Sterba --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0dca250dc02e..7a30fbc05e5e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8052,6 +8052,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +#define DUMP_BLOCK_RSV(fs_info, rsv_name) \ +do { \ + struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \ + spin_lock(&__rsv->lock);\ + btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \ + __rsv->size, __rsv->reserved); \ + spin_unlock(&__rsv->lock); \ +} while (0) + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -8071,6 +8080,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(>lock); + DUMP_BLOCK_RSV(fs_info, global_block_rsv); + DUMP_BLOCK_RSV(fs_info, trans_block_rsv); + DUMP_BLOCK_RSV(fs_info, chunk_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
[PATCH 8/8] btrfs: reserve extra space during evict()
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cae30f6c095f..3da9ac463344 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; + u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1); int failures = 0; for (;;) { struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, rsv->size, + ret = btrfs_block_rsv_refill(root, rsv, +rsv->size + delayed_refs_extra, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, return ERR_PTR(-ENOSPC); } + /* +* Evict can generate a large amount of delayed refs without +* having a way to add space back since we exhaust our temporary +* block rsv. We aren't allowed to do FLUSH_ALL in this case +* because we could deadlock with so many things in the flushing +* code, so we have to try and hold some extra space to +* compensate for our delayed ref generation. If we can't get +* that space then we need see if we can steal our minimum from +* the global reserve. We will be ratelimited by the amount of +* space we have for the delayed refs rsv, so we'll end up +* committing and trying again. +*/ trans = btrfs_join_transaction(root); - if (IS_ERR(trans) || !ret) + if (IS_ERR(trans) || !ret) { + if (!IS_ERR(trans)) { + trans->block_rsv = _info->trans_block_rsv; + trans->bytes_reserved = delayed_refs_extra; + btrfs_block_rsv_migrate(rsv, trans->block_rsv, + delayed_refs_extra, 1); + } return trans; + } /* * Try to steal from the global reserve if there is space for -- 2.14.3
[PATCH 1/8] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 33 +++-- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8af68b13fa27..0dca250dc02e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4843,10 +4843,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + /* +* See if there is enough pinned space to make this reservation, or if +* we have bg's that are going to be freed, allowing us to possibly do a +* chunk allocation the next loop through. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) || + __percpu_counter_compare(_info->total_bytes_pinned, bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; /* @@ -4854,7 +4862,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4868,17 +4876,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, bytes -= reclaim_bytes; if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - +bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; - return btrfs_commit_transaction(trans); +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /* -- 2.14.3
[PATCH 6/6] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 79 1 file changed, 17 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8532a2eb56d1..cae30f6c095f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, _info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans)) should_throttle = true; } @@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we +* can't flush, and we don't want to FLUSH_LIMIT because +* we could have generated too many delayed refs to +* actually allocate, so just bail if we're short and +
[PATCH 5/6] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 26 ++-- fs/btrfs/delayed-ref.c | 28 - fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 279 +++ fs/btrfs/inode.c | 4 +- fs/btrfs/transaction.c | 77 ++-- include/trace/events/btrfs.h | 2 + 7 files changed, 313 insertions(+), 107 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b41ec42f405..0c6d589c8ce4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -448,8 +448,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -812,6 +813,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2628,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, } int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans); -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans); +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, const u64 start); void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); @@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); @@ -2796,6 +2801,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 48725fa757a3..96de92588f06 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -474,11
[PATCH 0/6] Delayed refs rsv
This patchset changes how we do space reservations for delayed refs. We were hitting probably 20-40 enospc abort's per day in production while running delayed refs at transaction commit time. This means we ran out of space in the global reserve and couldn't easily get more space in use_block_rsv(). The global reserve has grown to cover everything we don't reserve space explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if we're running short on space and when it's time to force a commit. A failure rate of 20-40 file systems when we run hundreds of thousands of them isn't super high, but cleaning up this code will make things less ugly and more predictible. Thus the delayed refs rsv. We always know how many delayed refs we have outstanding, and although running them generates more we can use the global reserve for that spill over, which fits better into it's desired use than a full blown reservation. This first approach is to simply take how many times we're reserving space for and multiply that by 2 in order to save enough space for the delayed refs that could be generated. This is a niave approach and will probably evolve, but for now it works. With this patchset we've gone down to 2-8 failures per week. It's not perfect, there are some corner cases that still need to be addressed, but is significantly better than what we had. Thanks, Josef
[PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. Fix the accounting to only be adjusted when we add/remove a ref head. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index b3e4c9fcb664..48725fa757a3 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(>add_list, >ref_add_list); atomic_inc(>num_entries); - trans->delayed_ref_updates++; spin_unlock(>lock); return ret; } -- 2.14.3
[PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c36b3a42f2bb..e3ed3507018d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(>lock); spin_unlock(_refs->lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 3/6] btrfs: cleanup extent_op handling
From: Josef Bacik The cleanup_extent_op function actually would run the extent_op if it needed running, which made the name sort of a misnomer. Change it to run_and_cleanup_extent_op, and move the actual cleanup work to cleanup_extent_op so it can be used by check_ref_cleanup() in order to unify the extent op handling. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e3ed3507018d..8a776dc9cb38 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref btrfs_delayed_ref_unlock(head); } -static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +static struct btrfs_delayed_extent_op * +cleanup_extent_op(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; - int ret; if (!extent_op) - return 0; - head->extent_op = NULL; + return NULL; + if (head->must_insert_reserved) { + head->extent_op = NULL; btrfs_free_delayed_extent_op(extent_op); - return 0; + return NULL; } + return extent_op; +} + +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op = + cleanup_extent_op(trans, head); + int ret; + + if (!extent_op) + return 0; + head->extent_op = NULL; spin_unlock(>lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = >transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = run_and_cleanup_extent_op(trans, head); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(>ref_tree.rb_root)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(trans, head) != NULL) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 1/6] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 22 +++--- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9301b3ad9217..b3e4c9fcb664 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(_refs->lock); + lockdep_assert_held(>lock); + + rb_erase_cached(>href_node, _refs->href_root); + RB_CLEAR_NODE(>href_node); + atomic_dec(_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 8e20c5cb5404..d2af974f68a1 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(>mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head *btrfs_select_ref_head( struct btrfs_delayed_ref_root *delayed_refs); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d242a1174e50..c36b3a42f2bb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); - atomic_dec(_refs->num_entries); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(>mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); - atomic_dec(_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(>lock); spin_unlock(_refs->lock); -- 2.14.3
Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote: > Running btrfs/124 in a loop hung up on me sporadically with the > following call trace: > btrfs D0 5760 5324 0x > Call Trace: >? __schedule+0x243/0x800 >schedule+0x33/0x90 >btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs] >? wait_woken+0xa0/0xa0 >btrfs_wait_ordered_range+0xbb/0x100 [btrfs] >btrfs_relocate_block_group+0x1ff/0x230 [btrfs] >btrfs_relocate_chunk+0x49/0x100 [btrfs] >btrfs_balance+0xbeb/0x1740 [btrfs] >btrfs_ioctl_balance+0x2ee/0x380 [btrfs] >btrfs_ioctl+0x1691/0x3110 [btrfs] >? lockdep_hardirqs_on+0xed/0x180 >? __handle_mm_fault+0x8e7/0xfb0 >? _raw_spin_unlock+0x24/0x30 >? __handle_mm_fault+0x8e7/0xfb0 >? do_vfs_ioctl+0xa5/0x6e0 >? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] >do_vfs_ioctl+0xa5/0x6e0 >? entry_SYSCALL_64_after_hwframe+0x3e/0xbe >ksys_ioctl+0x3a/0x70 >__x64_sys_ioctl+0x16/0x20 >do_syscall_64+0x60/0x1b0 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Turns out during page writeback it's possible that the code in > writepage_delalloc can instantiate a delalloc range which doesn't > belong to the page currently being written back. This happens since > find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc > range when asked and doens't really consider the range of the passed > page. When such a foregin range is found the code proceeds to > run_delalloc_range and calls the appropriate function to fill the > delalloc and create ordered extents. If, however, a failure occurs > while this operation is in effect then the clean up code in > btrfs_cleanup_ordered_extents will be called. This function has the > wrong assumption that caller of run_delalloc_range always properly > cleans the first page of the range hence when it calls > __endio_write_update_ordered it explicitly ommits the first page of > the delalloc range. This is wrong because this function could be > cleaning a delalloc range that doesn't belong to the current page. This > in turn means that the page cleanup code in __extent_writepage will > not really free the initial page for the range, leaving a hanging > ordered extent with bytes_left set to 4k. This bug has been present > ever since the original introduction of the cleanup code in 524272607e88. > > Fix this by correctly checking whether the current page belongs to the > range being instantiated and if so correctly adjust the range parameters > passed for cleaning up. If it doesn't, then just clean the whole OE > range directly. > > Signed-off-by: Nikolay Borisov > Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered > extent hang") Can we just remove the endio cleanup in __extent_writepage() and make this do the proper cleanup? I'm not sure if that is feasible or not, but seems like it would make the cleanup stuff less confusing and more straightforward. If not you can add Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] Btrfs: remove no longer used io_err from btrfs_log_ctx
On Mon, Nov 12, 2018 at 10:24:30AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > The io_err field of struct btrfs_log_ctx is no longer used after the > recent simplification of the fast fsync path, where we now wait for > ordered extents to complete before logging the inode. We did this in > commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync > time") and commit a2120a473a80 ("btrfs: clean up the left over > logged_list usage") removed its last use. > > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] Btrfs: fix rare chances for data loss when doing a fast fsync
On Mon, Nov 12, 2018 at 10:23:58AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > After the simplification of the fast fsync patch done recently by commit > b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and > commit e7175a692765 ("btrfs: remove the wait ordered logic in the > log_one_extent path"), we got a very short time window where we can get > extents logged without writeback completing first or extents logged > without logging the respective data checksums. Both issues can only happen > when doing a non-full (fast) fsync. > > As soon as we enter btrfs_sync_file() we trigger writeback, then lock the > inode and then wait for the writeback to complete before starting to log > the inode. However before we acquire the inode's lock and after we started > writeback, it's possible that more writes happened and dirtied more pages. > If that happened and those pages get writeback triggered while we are > logging the inode (for example, the VM subsystem triggering it due to > memory pressure, or another concurrent fsync), we end up seeing the > respective extent maps in the inode's list of modified extents and will > log matching file extent items without waiting for the respective > ordered extents to complete, meaning that either of the following will > happen: > > 1) We log an extent after its writeback finishes but before its checksums >are added to the csum tree, leading to -EIO errors when attempting to >read the extent after a log replay. > > 2) We log an extent before its writeback finishes. >Therefore after the log replay we will have a file extent item pointing >to an unwritten extent (and without the respective data checksums as >well). > > This could not happen before the fast fsync patch simplification, because > for any extent we found in the list of modified extents, we would wait for > its respective ordered extent to finish writeback or collect its checksums > for logging if it did not complete yet. > > Fix this by triggering writeback again after acquiring the inode's lock > and before waiting for ordered extents to complete. > > Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the > log_one_extent path") > Fixes: b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") > CC: sta...@vger.kernel.org # 4.19+ > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
Re: [Regression bisected] btrfs: always wait on ordered extents at fsync time
On Fri, Nov 09, 2018 at 10:56:42PM +, Mel Gorman wrote: > On Fri, Nov 09, 2018 at 09:51:48PM +, Mel Gorman wrote: > > Unfortunately, as > > I'm about to travel, I didn't attempt a revert and a test comparing 4.18, > > 4.19 and 4.20-rc1 is a few hours away so this could potentially be fixed > > already but I didn't spot any obvious Fixes commit. > > > > Still here a few hours later but the regression still appears to exist > in mainline and the comparison report is below. While there are slightly > differences, the regressions are well outside multiples of the stddev and > co-efficient of variance so I'm fairly sure it's real. The one positive > thing is that the actual standard deviation is lower so the results are > more stable but that is a thin silver lining. > Yeah unfortunately this was expected, though in my tests I didn't see as harsh of a regression. I'll start working on making the regression suck less, thanks for running this down Mel, Josef
Re: [PATCH v15.1 04/13] btrfs: dedupe: Introduce function to remove hash from in-memory tree
On Tue, Nov 06, 2018 at 02:41:13PM +0800, Lu Fengqi wrote: > From: Wang Xiaoguang > > Introduce static function inmem_del() to remove hash from in-memory > dedupe tree. > And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces. > > Also for btrfs_dedupe_disable(), add new functions to wait existing > writer and block incoming writers to eliminate all possible race. > > Cc: Mark Fasheh > Signed-off-by: Qu Wenruo > Signed-off-by: Wang Xiaoguang > Signed-off-by: Lu Fengqi > --- > fs/btrfs/dedupe.c | 131 +++--- > 1 file changed, 125 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c > index 784bb3a8a5ab..951fefd19fde 100644 > --- a/fs/btrfs/dedupe.c > +++ b/fs/btrfs/dedupe.c > @@ -170,12 +170,6 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, > return ret; > } > > -int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) > -{ > - /* Place holder for bisect, will be implemented in later patches */ > - return 0; > -} > - > static int inmem_insert_hash(struct rb_root *root, >struct inmem_hash *hash, int hash_len) > { > @@ -317,3 +311,128 @@ int btrfs_dedupe_add(struct btrfs_fs_info *fs_info, > return inmem_add(dedupe_info, hash); > return -EINVAL; > } > + > +static struct inmem_hash * > +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) > +{ > + struct rb_node **p = _info->bytenr_root.rb_node; > + struct rb_node *parent = NULL; > + struct inmem_hash *entry = NULL; > + > + while (*p) { > + parent = *p; > + entry = rb_entry(parent, struct inmem_hash, bytenr_node); > + > + if (bytenr < entry->bytenr) > + p = &(*p)->rb_left; > + else if (bytenr > entry->bytenr) > + p = &(*p)->rb_right; > + else > + return entry; > + } > + > + return NULL; > +} > + > +/* Delete a hash from in-memory dedupe tree */ > +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) > +{ > + struct inmem_hash *hash; > + > + mutex_lock(_info->lock); > + hash = inmem_search_bytenr(dedupe_info, bytenr); > + if (!hash) { > + mutex_unlock(_info->lock); > + return 0; > + } > + > + __inmem_del(dedupe_info, hash); > + mutex_unlock(_info->lock); > + return 0; > +} > + > +/* Remove a dedupe hash from dedupe tree */ > +int btrfs_dedupe_del(struct btrfs_fs_info *fs_info, u64 bytenr) > +{ > + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; > + > + if (!fs_info->dedupe_enabled) > + return 0; > + > + if (WARN_ON(dedupe_info == NULL)) > + return -EINVAL; > + > + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) > + return inmem_del(dedupe_info, bytenr); > + return -EINVAL; > +} > + > +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info) > +{ > + struct inmem_hash *entry, *tmp; > + > + mutex_lock(_info->lock); > + list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list) > + __inmem_del(dedupe_info, entry); > + mutex_unlock(_info->lock); > +} > + > +/* > + * Helper function to wait and block all incoming writers > + * > + * Use rw_sem introduced for freeze to wait/block writers. > + * So during the block time, no new write will happen, so we can > + * do something quite safe, espcially helpful for dedupe disable, > + * as it affect buffered write. > + */ > +static void block_all_writers(struct btrfs_fs_info *fs_info) > +{ > + struct super_block *sb = fs_info->sb; > + > + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); > + down_write(>s_umount); > +} > + > +static void unblock_all_writers(struct btrfs_fs_info *fs_info) > +{ > + struct super_block *sb = fs_info->sb; > + > + up_write(>s_umount); > + percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1); > +} Please use the sb_ helpers, don't open code this. > + > +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) > +{ > + struct btrfs_dedupe_info *dedupe_info; > + int ret; > + > + dedupe_info = fs_info->dedupe_info; > + > + if (!dedupe_info) > + return 0; > + > + /* Don't allow disable status change in RO mount */ > + if (fs_info->sb->s_flags & MS_RDONLY) > + return -EROFS; > + > + /* > + * Wait for all unfinished writers and block further writers. > + * Then sync the whole fs so all current write will go through > + * dedupe, and all later write won't go through dedupe. > + */ > + block_all_writers(fs_info); > + ret = sync_filesystem(fs_info->sb); > + fs_info->dedupe_enabled = 0; > + fs_info->dedupe_info = NULL; > + unblock_all_writers(fs_info); This is awful, don't do this. Thanks, Josef
Re: [PATCH v15.1 01/13] btrfs: dedupe: Introduce dedupe framework and its header
On Tue, Nov 06, 2018 at 02:41:10PM +0800, Lu Fengqi wrote: > From: Wang Xiaoguang > > Introduce the header for btrfs in-band(write time) de-duplication > framework and needed header. > > The new de-duplication framework is going to support 2 different dedupe > methods and 1 dedupe hash. > > Signed-off-by: Qu Wenruo > Signed-off-by: Wang Xiaoguang > Signed-off-by: Lu Fengqi > --- > fs/btrfs/ctree.h | 7 ++ > fs/btrfs/dedupe.h | 128 - > fs/btrfs/disk-io.c | 1 + > include/uapi/linux/btrfs.h | 34 ++ > 4 files changed, 168 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 80953528572d..910050d904ef 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1118,6 +1118,13 @@ struct btrfs_fs_info { > spinlock_t ref_verify_lock; > struct rb_root block_tree; > #endif > + > + /* > + * Inband de-duplication related structures > + */ > + unsigned long dedupe_enabled:1; Please use a BTRFS_FS_ flag for this. Thanks, Josef
Re: [PATCH] btrfs: Check for missing device before bio submission in btrfs_map_bio
On Thu, Nov 08, 2018 at 04:16:38PM +0200, Nikolay Borisov wrote: > Before btrfs_map_bio submits all stripe bio it does a number of checks > to ensure the device for every stripe is present. However, it doesn't > do a DEV_STATE_MISSING check, instead this is relegated to the lower > level btrfs_schedule_bio (in the async submission case, sync submission > doesn't check DEV_STATE_MISSING at all). Additionally > btrfs_schedule_bios does the duplicate device->bdev check which has > already been performed in btrfs_map_bio. > > This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and > removes the duplicate device->bdev check. Doing so ensures that no bio > cloning/submission happens for both async/sync requests in the face of > missing device. This makes the async io submission path slightly shorter > in terms of instruction count. No functional changes. > > Signed-off-by: Nikolay Borisov > --- Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] Btrfs: simpler and more efficient cleanup of a log tree's extent io tree
On Fri, Nov 09, 2018 at 10:43:08AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > We currently are in a loop finding each range (corresponding to a btree > node/leaf) in a log root's extent io tree and then clean it up. This is a > waste of time since we are traversing the extent io tree's rb_tree more > times then needed (one for a range lookup and another for cleaning it up) > without any good reason. > > We free the log trees when we are in the critical section of a transaction > commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's > of great convenience to do everything as fast as possible in order to > reduce the time we block other tasks from starting a new transaction. > > So fix this by traversing the extent io tree once and cleaning up all its > records in one go while traversing it. > > Signed-off-by: Filipe Manana Sheesh Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] Btrfs: fix data corruption due to cloning of eof block
On Mon, Nov 05, 2018 at 11:14:17AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > We currently allow cloning a range from a file which includes the last > block of the file even if the file's size is not aligned to the block > size. This is fine and useful when the destination file has the same size, > but when it does not and the range ends somewhere in the middle of the > destination file, it leads to corruption because the bytes between the EOF > and the end of the block have undefined data (when there is support for > discard/trimming they have a value of 0x00). > > Example: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > > $ export foo_size=$((256 * 1024 + 100)) > $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo > $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar > > $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar > > $ od -A d -t x1 /mnt/bar > 000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 > * > 0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c > * > 0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00 > 0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 > * > 1048576 > > The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527 > (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead > of 0xb5. > > This is similar to the problem we had for deduplication that got recently > fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when > deduplicating between different files"). > > Fix this by not allowing such operations to be performed and return the > errno -EINVAL to user space. This is what XFS is doing as well at the VFS > level. This change however now makes us return -EINVAL instead of > -EOPNOTSUPP for cases where the source range maps to an inline extent and > the destination range's end is smaller then the destination file's size, > since the detection of inline extents is done during the actual process of > dropping file extent items (at __btrfs_drop_extents()). Returning the > -EINVAL error is done early on and solely based on the input parameters > (offsets and length) and destination file's size. This makes us consistent > with XFS and anyone else supporting cloning since this case is now checked > at a higher level in the VFS and is where the -EINVAL will be returned > from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1 > by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into > partial EOF block"). So this change is more geared towards stable kernels, > as it's unlikely the new VFS checks get removed intentionally. > > A test case for fstests follows soon, as well as an update to filter > existing tests that expect -EOPNOTSUPP to accept -EINVAL as well. > > CC: # 4.4+ > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 8/8] btrfs: Remove extent_io_ops::split_extent_hook callback
On Thu, Nov 01, 2018 at 02:09:53PM +0200, Nikolay Borisov wrote: > This is the counterpart to merge_extent_hook, similarly, it's used only > for data/freespace inodes so let's remove it, rename it and call it > directly where necessary. No functional changes. > > Signed-off-by: Nikolay Borisov > --- Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 7/8] btrfs: Remove extent_io_ops::merge_extent_hook callback
On Thu, Nov 01, 2018 at 02:09:52PM +0200, Nikolay Borisov wrote: > This callback is used only for data and free space inodes. Such inodes > are guaranteed to have their extent_io_tree::private_data set to the > inode struct. Exploit this fact to directly call the function. Also > give it a more descriptive name. No functional changes. > > Signed-off-by: Nikolay Borisov > --- Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 6/8] btrfs: Remove extent_io_ops::clear_bit_hook callback
On Thu, Nov 01, 2018 at 02:09:51PM +0200, Nikolay Borisov wrote: > This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent), > similar to what was done before remove clear_bit_hook and rename the > function. No functional changes. > > Signed-off-by: Nikolay Borisov Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 5/8] btrfs: Remove extent_io_ops::set_bit_hook extent_io callback
On Thu, Nov 01, 2018 at 02:09:50PM +0200, Nikolay Borisov wrote: > This callback is used to properly account delalloc extents for > data inodes (ordinary file inodes and freespace v1 inodes). Those can > be easily identified since they have their extent_io trees > ->private_data member point to the inode. Let's exploit this fact to > remove the needless indirection through extent_io_hooks and directly > call the function. Also give the function a name which reflects its > purpose - btrfs_set_delalloc_extent. > > This patch also modified test_find_delalloc so that the extent_io_tree > used for testing doesn't have its ->private_data set which would have > caused a crash in btrfs_set_delalloc_extent due to the > btrfs_inode->root member not being initialised. The old version of the > code also didn't call set_bit_hook since the extent_io ops weren't set > for the inode. No functional changes. > > Signed-off-by: Nikolay Borisov Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 4/8] btrfs: Remove extent_io_ops::check_extent_io_range callback
On Thu, Nov 01, 2018 at 02:09:49PM +0200, Nikolay Borisov wrote: > This callback was only used in debug builds by btrfs_leak_debug_check. > A better approach is to move its implementation in > btrfs_leak_debug_check and ensure the latter is only executed for > extent tree which have ->private_data set i.e. relate to a data node and > not the btree one. No functional changes. > > Signed-off-by: Nikolay Borisov Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 3/8] btrfs: Remove extent_io_ops::writepage_end_io_hook
On Thu, Nov 01, 2018 at 02:09:48PM +0200, Nikolay Borisov wrote: > This callback is ony ever called for data page writeout so > there is no need to actually abstract it via extent_io_ops. Lets just > export it, remove the definition of the callback and call it directly > in the functions that invoke the callback. Also rename the function to > btrfs_writepage_endio_finish_ordered since what it really does is > account finished io in the ordered extent data structures. > No functional changes. > > Signed-off-by: Nikolay Borisov Could send another cleanup patch to remove the struct extent_state *state from the arg list as well. Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 2/8] btrfs: Remove extent_io_ops::writepage_start_hook
On Thu, Nov 01, 2018 at 02:09:47PM +0200, Nikolay Borisov wrote: > This hook is called only from __extent_writepage_io which is already > called only from the data page writeout path. So there is no need to > make an indirect call via extent_io_ops. This patch just removes the > callback definition, exports the callback function and calls it > directly at the only call site. Also give the function a more descriptive > name. No functional changes. > > Signed-off-by: Nikolay Borisov Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH 1/8] btrfs: Remove extent_io_ops::fill_delalloc
On Thu, Nov 01, 2018 at 02:09:46PM +0200, Nikolay Borisov wrote: > This callback is called only from writepage_delalloc which in turn > is guaranteed to be called from the data page writeout path. In the end > there is no reason to have the call to this function to be indrected > via the extent_io_ops structure. This patch removes the callback > definition, exports the function and calls it directly. No functional > changes. > > Signed-off-by: Nikolay Borisov Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] fstests: fix fssum to actually ignore file holes when supposed to
On Mon, Oct 29, 2018 at 09:43:40AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > Unless the '-s' option is passed to fssum, it should not detect file holes > and have their existence influence the computed checksum for a file. This > tool was added to test btrfs' send/receive feature, so that it checks for > any metadata and data differences between the original filesystem and the > filesystem that receives send streams. > > For a long time the test btrfs/007, which tests btrfs' send/receive with > fsstress, fails sporadically reporting data differences between files. > However the md5sum/sha1sum from the reported files in the original and > new filesystems are the same. The reason why fssum fails is because even > in normal mode it still accounts for number of holes that exist in the > file and their respective lengths. This is done using the SEEK_DATA mode > of lseek. The btrfs send feature does not preserve holes nor prealloc > extents (not supported by the current protocol), so whenever a hole or > prealloc (unwritten) extent is detected in the source filesystem, it > issues a write command full of zeroes, which will translate to a regular > (written) extent in the destination filesystem. This is why fssum reports > a different checksum. A prealloc extent also counts as hole when using > lseek. > > For example when passing a seed of 1540592967 to fsstress in btrfs/007, > the test fails, as file p0/d0/f7 has a prealloc extent in the original > filesystem (in the incr snapshot). > > Fix this by making fssum just read the hole file and feed its data to the > digest calculation function when option '-s' is not given. If we ever get > btrfs' send/receive to support holes and fallocate, we can just change > the test and pass the '-s' option to all fssum calls. > > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] Btrfs: fix missing data checksums after a ranged fsync (msync)
On Mon, Oct 29, 2018 at 09:42:06AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > Recently we got a massive simplification for fsync, where for the fast > path we no longer log new extents while their respective ordered extents > are still running. > > However that simplification introduced a subtle regression for the case > where we use a ranged fsync (msync). Consider the following example: > >CPU 0CPU 1 > > mmap write to range [2Mb, 4Mb[ > mmap write to range [512Kb, 1Mb[ > msync range [512K, 1Mb[ > --> triggers fast fsync > (BTRFS_INODE_NEEDS_FULL_SYNC > not set) > --> creates extent map A for this > range and adds it to list of > modified extents > --> starts ordered extent A for > this range > --> waits for it to complete > > writeback triggered for range > [2Mb, 4Mb[ > --> create extent map B and > adds it to the list of > modified extents > --> creates ordered extent B > > --> start looking for and logging > modified extents > --> logs extent maps A and B > --> finds checksums for extent A > in the csum tree, but not for > extent B > fsync (msync) finishes > > --> ordered extent B > finishes and its > checksums are added > to the csum tree > > > > After replaying the log, we have the extent covering the range [2Mb, 4Mb[ > but do not have the data checksum items covering that file range. > > This happens because at the very beginning of an fsync (btrfs_sync_file()) > we start and wait for IO in the given range [512Kb, 1Mb[ and therefore > wait for any ordered extents in that range to complete before we start > logging the extents. However if right before we start logging the extent > in our range [512Kb, 1Mb[, writeback is started for any other dirty range, > such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync > or msync (btrfs_sync_file() starts writeback before acquiring the inode's > lock), an ordered extent is created for that other range and a new extent > map is created to represent that range and added to the inode's list of > modified extents. > > That means that we will see that other extent in that list when collecting > extents for logging (done at btrfs_log_changed_extents()) and log the > extent before the respective ordered extent finishes - namely before the > checksum items are added to the checksums tree, which is where > log_extent_csums() looks for the checksums, therefore making us log an > extent without logging its checksums. Before that massive simplification > of fsync, this wasn't a problem because besides looking for checkums in > the checksums tree, we also looked for them in any ordered extent still > running. > > The consequence of data checksums missing for a file range is that users > attempting to read the affected file range will get -EIO errors and dmesg > reports the following: > > [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start > 57344 > [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off > 57344 csum 0x98f94189 expected csum 0x mirror 1 > > So fix this by skipping an extents outside of our logging range at > btrfs_log_changed_extents() and leaving them on the list of modified > extents so that any subsequent ranged fsync may collect them if needed. > Also, if we find a hole extent outside of the range still log it, just > to prevent having gaps between extent items after replaying the log, > otherwise fsck will complain when we are not using the NO_HOLES feature > (fstest btrfs/056 triggers such case). > > Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the > log_one_extent path") > CC: sta...@vger.kernel.org # 4.19+ > Signed-off-by: Filipe Manana Nice catch, Reviewed-by: Josef Bacik Josef
Re: [PATCH 0/3] fix pinned underflow in generic/475
On Wed, Oct 24, 2018 at 08:24:00PM +0800, Lu Fengqi wrote: > When running generic/475, pinned underflow may occur. This patch will > fix this problem, but there are still other warnings need to addressed in > this case. > > Patch 1-2 introduce a macro and wrappers to help detect underflow > Patch 3 the fix patch of pinned underflow > > Lu Fengqi (2): > btrfs: extent-tree: Detect bytes_pinned underflow earlier > btrfs: fix pinned underflow after transaction aborted > > Qu Wenruo (1): > btrfs: extent-tree: Detect bytes_may_use underflow earlier > You can add Reviewed-by: Josef Bacik to the series, thanks, Josef
Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent
On Wed, Oct 24, 2018 at 01:48:40PM +0100, Filipe Manana wrote: > On Wed, Oct 24, 2018 at 1:40 PM Josef Bacik wrote: > > > > On Wed, Oct 24, 2018 at 12:53:59PM +0100, Filipe Manana wrote: > > > On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik wrote: > > > > > > > > On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote: > > > > > From: Filipe Manana > > > > > > > > > > When we are writing out a free space cache, during the transaction > > > > > commit > > > > > phase, we can end up in a deadlock which results in a stack trace > > > > > like the > > > > > following: > > > > > > > > > > schedule+0x28/0x80 > > > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > > > ? finish_wait+0x80/0x80 > > > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > > > ? inode_insert5+0x119/0x190 > > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > > btrfs_iget+0x113/0x690 [btrfs] > > > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > > > ? finish_wait+0x80/0x80 > > > > > find_free_extent+0x799/0x1010 [btrfs] > > > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > > > > > > > > At cache_save_setup() we need to update the inode item of a block > > > > > group's > > > > > cache which is located in the tree root (fs_info->tree_root), which > > > > > means > > > > > that it may result in COWing a leaf from that tree. If that happens we > > > > > need to find a free metadata extent and while looking for one, if we > > > > > find > > > > > a block group which was not cached yet we attempt to load its cache by > > > > > calling cache_block_group(). However this function will try to load > > > > > the > > > > > inode of the free space cache, which requires finding the matching > > > > > inode > > > > > item in the tree root - if that inode item is located in the same > > > > > leaf as > > > > > the inode item of the space cache we are updating at > > > > > cache_save_setup(), > > > > > we end up in a deadlock, since we try to obtain a read lock on the > > > > > same > > > > > extent buffer that we previously write locked. > > > > > > > > > > So fix this by using the tree root's commit root when searching for a > > > > > block group's free space cache inode item when we are attempting to > > > > > load > > > > > a free space cache. This is safe since block groups once loaded stay > > > > > in > > > > > memory forever, as well as their caches, so after they are first > > > > > loaded > > > > > we will never need to read their inode items again. For new block > > > > > groups, > > > > > once they are created they get their ->cached field set to > > > > > BTRFS_CACHE_FINISHED meaning we will not need to read their inode > > > > > item. > > > > > > > > > > Reported-by: Andrew Nelson > > > > > Link: > > > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > > > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > > >
Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent
On Wed, Oct 24, 2018 at 12:53:59PM +0100, Filipe Manana wrote: > On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik wrote: > > > > On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote: > > > From: Filipe Manana > > > > > > When we are writing out a free space cache, during the transaction commit > > > phase, we can end up in a deadlock which results in a stack trace like the > > > following: > > > > > > schedule+0x28/0x80 > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > ? finish_wait+0x80/0x80 > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > ? inode_insert5+0x119/0x190 > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_iget+0x113/0x690 [btrfs] > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > ? finish_wait+0x80/0x80 > > > find_free_extent+0x799/0x1010 [btrfs] > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > > > > At cache_save_setup() we need to update the inode item of a block group's > > > cache which is located in the tree root (fs_info->tree_root), which means > > > that it may result in COWing a leaf from that tree. If that happens we > > > need to find a free metadata extent and while looking for one, if we find > > > a block group which was not cached yet we attempt to load its cache by > > > calling cache_block_group(). However this function will try to load the > > > inode of the free space cache, which requires finding the matching inode > > > item in the tree root - if that inode item is located in the same leaf as > > > the inode item of the space cache we are updating at cache_save_setup(), > > > we end up in a deadlock, since we try to obtain a read lock on the same > > > extent buffer that we previously write locked. > > > > > > So fix this by using the tree root's commit root when searching for a > > > block group's free space cache inode item when we are attempting to load > > > a free space cache. This is safe since block groups once loaded stay in > > > memory forever, as well as their caches, so after they are first loaded > > > we will never need to read their inode items again. For new block groups, > > > once they are created they get their ->cached field set to > > > BTRFS_CACHE_FINISHED meaning we will not need to read their inode item. > > > > > > Reported-by: Andrew Nelson > > > Link: > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > > > Tested-by: Andrew Nelson > > > Signed-off-by: Filipe Manana > > > --- > > > > > > > Now my goal is to see how many times I can get you to redo this thing. > > > > Why not instead just do > > > > if (btrfs_is_free_space_inode(inode)) > > path->search_commit_root = 1; > > > > in read_locked_inode? That would be cleaner. If we don't want to do that > > for > > the inode cache (I'm not sure if it's ok or not) we could just do > > > > if (root == fs_info->tree_root) > > We can't (not just that at least). > Tried something like that, but we get into a BUG_ON when writing out > the space cache for new block groups (created in the current > transaction). > Because at cache_save_setup() we have this: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c?h=v4.19#n3342 > > Lookup for the inode in normal root, doesn't exist, create it then > repeat - if still not found, BUG_ON. > Could also make create_free_space_inode() return an inode pointer and > make it call btrfs_iget(). > Ah ok makes sense. Well in that case lets just make btrfs_read_locked_inode() take a path, and allocate it in btrfs_iget, that'll remove the ugly if (path != in_path) stuff. Thanks, Josef
Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent
On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by using the tree root's commit root when searching for a > block group's free space cache inode item when we are attempting to load > a free space cache. This is safe since block groups once loaded stay in > memory forever, as well as their caches, so after they are first loaded > we will never need to read their inode items again. For new block groups, > once they are created they get their ->cached field set to > BTRFS_CACHE_FINISHED meaning we will not need to read their inode item. > > Reported-by: Andrew Nelson > Link: > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > Tested-by: Andrew Nelson > Signed-off-by: Filipe Manana > --- > Now my goal is to see how many times I can get you to redo this thing. Why not instead just do if (btrfs_is_free_space_inode(inode)) path->search_commit_root = 1; in read_locked_inode? That would be cleaner. If we don't want to do that for the inode cache (I'm not sure if it's ok or not) we could just do if (root == fs_info->tree_root) instead. Thanks, Josef
Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent
On Mon, Oct 22, 2018 at 11:05:08PM +0100, Filipe Manana wrote: > On Mon, Oct 22, 2018 at 8:18 PM Josef Bacik wrote: > > > > On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote: > > > From: Filipe Manana > > > > > > When we are writing out a free space cache, during the transaction commit > > > phase, we can end up in a deadlock which results in a stack trace like the > > > following: > > > > > > schedule+0x28/0x80 > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > ? finish_wait+0x80/0x80 > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > ? inode_insert5+0x119/0x190 > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_iget+0x113/0x690 [btrfs] > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > ? finish_wait+0x80/0x80 > > > find_free_extent+0x799/0x1010 [btrfs] > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > ? kmem_cache_alloc+0x166/0x1d0 > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > > > > At cache_save_setup() we need to update the inode item of a block group's > > > cache which is located in the tree root (fs_info->tree_root), which means > > > that it may result in COWing a leaf from that tree. If that happens we > > > need to find a free metadata extent and while looking for one, if we find > > > a block group which was not cached yet we attempt to load its cache by > > > calling cache_block_group(). However this function will try to load the > > > inode of the free space cache, which requires finding the matching inode > > > item in the tree root - if that inode item is located in the same leaf as > > > the inode item of the space cache we are updating at cache_save_setup(), > > > we end up in a deadlock, since we try to obtain a read lock on the same > > > extent buffer that we previously write locked. > > > > > > So fix this by skipping the loading of free space caches of any block > > > groups that are not yet cached (rare cases) if we are COWing an extent > > > buffer from the root tree and space caching is enabled (-o space_cache > > > mount option). This is a rare case and its downside is failure to > > > find a free extent (return -ENOSPC) when all the already cached block > > > groups have no free extents. > > > > > > Reported-by: Andrew Nelson > > > Link: > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > > > Tested-by: Andrew Nelson > > > Signed-off-by: Filipe Manana > > > > Great, thanks, > > > > Reviewed-by: Josef Bacik > > So this makes many fstests occasionally fail with aborted transaction > due to ENOSPC. > It's late and I haven't verified yet, but I suppose this is because we > always skip loading the cache regardless of currently being COWing an > existing leaf or allocating a new one (growing the tree). > Needs to be fixed. > How about we just use path->search_commit_root? If we're loading the cache we just want the last committed version, it's not like we read it after we've written it. Then we can go back to business as usual. Thanks, Josef
Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent
On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by skipping the loading of free space caches of any block > groups that are not yet cached (rare cases) if we are COWing an extent > buffer from the root tree and space caching is enabled (-o space_cache > mount option). This is a rare case and its downside is failure to > find a free extent (return -ENOSPC) when all the already cached block > groups have no free extents. > > Reported-by: Andrew Nelson > Link: > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > Tested-by: Andrew Nelson > Signed-off-by: Filipe Manana Great, thanks, Reviewed-by: Josef Bacik Josef
Re: [PATCH v2] Btrfs: fix deadlock on tree root leaf when finding free extent
On Mon, Oct 22, 2018 at 07:48:30PM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by skipping the loading of free space caches of any block > groups that are not yet cached (rare cases) if we are COWing an extent > buffer from the root tree and space caching is enabled (-o space_cache > mount option). This is a rare case and its downside is failure to > find a free extent (return -ENOSPC) when all the already cached block > groups have no free extents. > > Reported-by: Andrew Nelson > Link: > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > Tested-by: Andrew Nelson > Signed-off-by: Filipe Manana > --- > > V2: Made the solution more generic, since the problem could happen in any > path COWing an extent buffer from the root tree. > > Applies on top of a previous patch titled: > > "Btrfs: fix deadlock when writing out free space caches" > > fs/btrfs/ctree.c | 4 > fs/btrfs/ctree.h | 3 +++ > fs/btrfs/disk-io.c | 2 ++ > fs/btrfs/extent-tree.c | 15 ++- > 4 files changed, 23 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > index 089b46c4d97f..646aafda55a3 100644 > --- a/fs/btrfs/ctree.c > +++ b/fs/btrfs/ctree.c > @@ -1065,10 +1065,14 @@ static noinline int __btrfs_cow_block(struct > btrfs_trans_handle *trans, > root == fs_info->chunk_root || > root == fs_info->dev_root) > trans->can_flush_pending_bgs = false; > + else if (root == fs_info->tree_root) > + atomic_inc(_info->tree_root_cows); > > cow = btrfs_alloc_tree_block(trans, root, parent_start, > root->root_key.objectid, _key, level, > search_start, empty_size); > + if (root == fs_info->tree_root) > + atomic_dec(_info->tree_root_cows); Do we need this though? Our root should be the root we're cow'ing the block for, and it should be passed all the way down to find_free_extent properly, so we really should be able to just do if (root == fs_info->tree_root) and not add all this stuff. Not to mention this will race with anybody else doing stuff, so if another thread that isn't actually touching the tree_root it would skip caching a block group when it's completely ok for that thread to do it. Thanks, Josef
Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent
On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdman...@kernel.org wrote: > From: Filipe Manana > > When we are writing out a free space cache, during the transaction commit > phase, we can end up in a deadlock which results in a stack trace like the > following: > > schedule+0x28/0x80 > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > ? finish_wait+0x80/0x80 > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > btrfs_search_slot+0xf6/0x9f0 [btrfs] > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > ? inode_insert5+0x119/0x190 > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_iget+0x113/0x690 [btrfs] > __lookup_free_space_inode+0xd8/0x150 [btrfs] > lookup_free_space_inode+0x5b/0xb0 [btrfs] > load_free_space_cache+0x7c/0x170 [btrfs] > ? cache_block_group+0x72/0x3b0 [btrfs] > cache_block_group+0x1b3/0x3b0 [btrfs] > ? finish_wait+0x80/0x80 > find_free_extent+0x799/0x1010 [btrfs] > btrfs_reserve_extent+0x9b/0x180 [btrfs] > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > __btrfs_cow_block+0x11d/0x500 [btrfs] > btrfs_cow_block+0xdc/0x180 [btrfs] > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > ? kmem_cache_alloc+0x166/0x1d0 > btrfs_update_inode_item+0x46/0x100 [btrfs] > cache_save_setup+0xe4/0x3a0 [btrfs] > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > At cache_save_setup() we need to update the inode item of a block group's > cache which is located in the tree root (fs_info->tree_root), which means > that it may result in COWing a leaf from that tree. If that happens we > need to find a free metadata extent and while looking for one, if we find > a block group which was not cached yet we attempt to load its cache by > calling cache_block_group(). However this function will try to load the > inode of the free space cache, which requires finding the matching inode > item in the tree root - if that inode item is located in the same leaf as > the inode item of the space cache we are updating at cache_save_setup(), > we end up in a deadlock, since we try to obtain a read lock on the same > extent buffer that we previously write locked. > > So fix this by skipping the loading of free space caches of any block > groups that are not yet cached (rare cases) if we are updating the inode > of a free space cache. This is a rare case and its downside is failure to > find a free extent (return -ENOSPC) when all the already cached block > groups have no free extents. > Actually isn't this a problem for anything that tries to allocate an extent while in the tree_root? Like we snapshot or make a subvolume or anything? We should just disallow if root == tree_root. But even then we only need to do this if we're using SPACE_CACHE, using the ye-olde caching or the free space tree are both ok. Let's just limit it to those cases. Thanks, Josef
Re: [PATCH] Btrfs: fix use-after-free when dumping free space
] kthread+0x1d2/0x1f0 > [ 9520.409138] ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs] > [ 9520.409440] ? kthread_park+0xb0/0xb0 > [ 9520.409682] ret_from_fork+0x3a/0x50 > [ 9520.410508] Dumping ftrace buffer: > [ 9520.410764](ftrace buffer empty) > [ 9520.411007] CR2: 0011 > [ 9520.411297] ---[ end trace 01a0863445cf360a ]--- > [ 9520.411568] RIP: 0010:rb_next+0x3c/0x90 > [ 9520.412644] RSP: 0018:8801074ff780 EFLAGS: 00010292 > [ 9520.412932] RAX: RBX: 0001 RCX: > 81b5ac4c > [ 9520.413274] RDX: RSI: 0008 RDI: > 0011 > [ 9520.413616] RBP: 8801074ff7a0 R08: ed0021d64ccc R09: > ed0021d64ccc > [ 9520.414007] R10: 0001 R11: ed0021d64ccb R12: > 8800b91e > [ 9520.414349] R13: 8800a3ceba48 R14: 8800b627bf80 R15: > 0002 > [ 9520.416074] FS: () GS:88010eb0() > knlGS: > [ 9520.416536] CS: 0010 DS: ES: CR0: 80050033 > [ 9520.416848] CR2: 0011 CR3: 000106b52000 CR4: > 06a0 > [ 9520.418477] DR0: DR1: DR2: > > [ 9520.418846] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 9520.419204] Kernel panic - not syncing: Fatal exception > [ 9520.419666] Dumping ftrace buffer: > [ 9520.419930](ftrace buffer empty) > [ 9520.420168] Kernel Offset: disabled > [ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]--- > > Fix this by acquiring the respective lock before iterating the rbtree. > > Reported-by: Nikolay Borisov > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef
[PATCH 33/42] btrfs: fix insert_reserved error handling
We were not handling the reserved byte accounting properly for data references. Metadata was fine, if it errored out the error paths would free the bytes_reserved count and pin the extent, but it even missed one of the error cases. So instead move this handling up into run_one_delayed_ref so we are sure that both cases are properly cleaned up in case of a transaction abort. Reviewed-by: David Sterba Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 933cba06c9fb..9b5366932298 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, insert_reserved); else BUG(); + if (ret && insert_reserved) + btrfs_pin_extent(trans->fs_info, node->bytenr, +node->num_bytes, 1); return ret; } @@ -8257,21 +8260,14 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans, } path = btrfs_alloc_path(); - if (!path) { - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); + if (!path) return -ENOMEM; - } path->leave_spinning = 1; ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path, _key, size); if (ret) { btrfs_free_path(path); - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); return ret; } -- 2.14.3
[PATCH 31/42] btrfs: cleanup pending bgs on transaction abort
We may abort the transaction during a commit and not have a chance to run the pending bgs stuff, which will leave block groups on our list and cause us accounting issues and leaked memory. Fix this by running the pending bgs when we cleanup a transaction. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 46ca775a709e..9168efaca37e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2280,6 +2280,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_scrub_continue(fs_info); cleanup_transaction: btrfs_trans_release_metadata(trans); + /* This cleans up the pending block groups list properly. */ + if (!trans->aborted) + trans->aborted = ret; + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); trans->block_rsv = NULL; btrfs_warn(fs_info, "Skipping commit of aborted transaction."); -- 2.14.3