[PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Josef Bacik
From: Josef Bacik 

With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent

BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
35964  owner 1 offset 0
Call Trace:
 ? btrfs_merge_delayed_refs+0xaf/0x340
 __btrfs_run_delayed_refs+0x6ea/0xfc0
 ? btrfs_set_path_blocking+0x31/0x60
 btrfs_run_delayed_refs+0xeb/0x180
 btrfs_commit_transaction+0x179/0x7f0
 ? btrfs_check_space_for_delayed_refs+0x30/0x50
 ? should_end_transaction.isra.19+0xe/0x40
 btrfs_drop_snapshot+0x41c/0x7c0
 btrfs_clean_one_deleted_snapshot+0xb5/0xd0
 cleaner_kthread+0xf6/0x120
 kthread+0xf8/0x130
 ? btree_invalidatepage+0x90/0x90
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40

This was because btrfs_drop_snapshot depends on the root not being modified
while it's dropping the snapshot.  It will unlock the root node (and really
every node) as it walks down the tree, only to re-lock it when it needs to do
something.  This is a problem because if we modify the tree we could cow a block
in our path, which free's our reference to that block.  Then once we get back to
that shared block we'll free our reference to it again, and get ENOENT when
trying to lookup our extent reference to that block in __btrfs_free_extent.

This is ultimately happening because we have delayed items left to be processed
for our deleted snapshot _after_ all of the inodes are closed for the snapshot.
We only run the delayed inode item if we're deleting the inode, and even then we
do not run the delayed insertions or delayed removals.  These can be run at any
point after our final inode does it's last iput, which is what triggers the
snapshot deletion.  We can end up with the snapshot deletion happening and then
have the delayed items run on that file system, resulting in the above problem.

This problem has existed forever, however my patches made it much easier to hit
as I wake up the cleaner much more often to deal with delayed iputs, which made
us more likely to start the snapshot dropping work before the transaction
commits, which is when the delayed items would generally be run.  Before,
generally speaking, we would run the delayed items, commit the transaction, and
wakeup the cleaner thread to start deleting snapshots, which means we were less
likely to hit this problem.  You could still hit it if you had multiple
snapshots to be deleted and ended up with lots of delayed items, but it was
definitely harder.

Fix for now by simply running all the delayed items before starting to drop the
snapshot.  We could make this smarter in the future by making the delayed items
per-root, and then simply drop any delayed items for roots that we are going to
delete.  But for now just a quick and easy solution is the safest.

Cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
v1->v2:
- check for errors from btrfs_run_delayed_items.
- Dave I can reroll the series, but the second version of patch 1 is the same,
  let me know what you want.

 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dcb699dd57f3..473084eb7a2d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
goto out_free;
}
 
+   err = btrfs_run_delayed_items(trans);
+   if (err)
+   goto out_end_trans;
+
if (block_rsv)
trans->block_rsv = block_rsv;
 
-- 
2.14.3



Re: [RFC PATCH] btrfs: Remove __extent_readpages

2018-12-05 Thread Josef Bacik
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
> example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller 
> than 
> the memory, so pages should have been nicely readahead. 
> 

What this patch changes is now we aren't reading all of the pages we are passed
by the readahead, so now we fall back to per-page reading when we go to read
those pages because the readahead window has already moved past them.  This
isn't great behavior, what we have is nice in that it tries to group the entire
range together as much as possible.  What your patch changes is that as soon as
we stop having a contiguous range we just bail and submit what we have.  Testing
it in isolation is likely to show very little change, but having recently
touched the fault in code where we definitely do not count major faults in some
cases I'd suspect that we're not reflecting this higher fault rate in the
performance counters properly.  We should preserve the existing behavior, what
hurts a little bit on a lightly loaded box is going to hurt a whole lot more on
a heavily loaded box.  Thanks,

Josef


Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-12-04 Thread Josef Bacik
On Tue, Dec 04, 2018 at 01:46:58PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h   |  4 +++-
> >  fs/btrfs/disk-io.c |  5 ++---
> >  fs/btrfs/extent-tree.c | 13 -
> >  fs/btrfs/inode.c   | 21 +
> >  4 files changed, 34 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index dc56a4d940c3..20af5d6d81f1 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -915,7 +915,8 @@ struct btrfs_fs_info {
> >  
> > spinlock_t delayed_iput_lock;
> > struct list_head delayed_iputs;
> > -   struct mutex cleaner_delayed_iput_mutex;
> > +   atomic_t nr_delayed_iputs;
> > +   wait_queue_head_t delayed_iputs_wait;
> >  
> 
> Have you considered whether the same could be achieved with a completion
> rather than an open-coded waitqueue? I tried prototyping it and it could
> be done but it becomes messy regarding when the completion should be
> initialised i.e only when it's not in btrfs_add_delayed_iput
> 

Yeah a waitqueue makes more sense here than a completion since it's not a
one-off.

> 
> 
> 
> > @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
> >  * bunch of pinned space, so make sure we run the iputs before
> >  * we do our pinned bytes check below.
> >  */
> > -   mutex_lock(_info->cleaner_delayed_iput_mutex);
> > btrfs_run_delayed_iputs(fs_info);
> > -   mutex_unlock(_info->cleaner_delayed_iput_mutex);
> > +   btrfs_wait_on_delayed_iputs(fs_info);
> 
> Waiting on delayed iputs here is pointless since they are run
> synchronously form this context.
> 

Unless there are other threads (the cleaner thread) running iputs as well.  We
could be running an iput that is evicting the inode in another thread and we
really want that space, so we need to wait here to make sure everybody is truly
done.  Thanks,

Josef


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-12-04 Thread Josef Bacik
On Tue, Dec 04, 2018 at 11:21:14AM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The cleaner thread usually takes care of delayed iputs, with the
> > exception of the btrfs_end_transaction_throttle path.  The cleaner
> > thread only gets woken up every 30 seconds, so instead wake it up to do
> > it's work so that we can free up that space as quickly as possible.
> 
> This description misses any rationale whatsoever about why the cleaner
> needs to be woken up more frequently than 30 seconds (and IMO this is
> the most important question that needs answering).
> 

Yeah I'll add that.

> Also have you done any measurements of the number of processed delayed
> inodes with this change. Given the behavior you so desire why not just
> make delayed iputs to be run via schedule_work on the global workqueue
> and be done with it? I'm sure the latency will be better than the
> current 30 seconds one :)

We already have the cleaner thread to do this work, and it sets up for the
snapshot drop stuff to be run as well.  We could probably add another delayed
work thing, but I would rather do that in a different patch.  Thanks,

Josef


[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-12-03 Thread Josef Bacik
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  The cleaner
thread only gets woken up every 30 seconds, so instead wake it up to do
it's work so that we can free up that space as quickly as possible.

Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   | 3 +++
 fs/btrfs/disk-io.c | 3 +++
 fs/btrfs/inode.c   | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c8ddbacb6748..dc56a4d940c3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -769,6 +769,9 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info 
*fs_info, void *ptr);
  */
 #define BTRFS_FS_BALANCE_RUNNING   18
 
+/* Indicate that the cleaner thread is awake and doing something. */
+#define BTRFS_FS_CLEANER_RUNNING   19
+
 struct btrfs_fs_info {
u8 fsid[BTRFS_FSID_SIZE];
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5918ff8241b..f40f6fdc1019 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1669,6 +1669,8 @@ static int cleaner_kthread(void *arg)
while (1) {
again = 0;
 
+   set_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags);
+
/* Make the cleaner go to sleep early. */
if (btrfs_need_cleaner_sleep(fs_info))
goto sleep;
@@ -1715,6 +1717,7 @@ static int cleaner_kthread(void *arg)
 */
btrfs_delete_unused_bgs(fs_info);
 sleep:
+   clear_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags);
if (kthread_should_park())
kthread_parkme();
if (kthread_should_stop())
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ac7abe2ae9b..0b9f3e482cea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3264,6 +3264,8 @@ void btrfs_add_delayed_iput(struct inode *inode)
ASSERT(list_empty(>delayed_iput));
list_add_tail(>delayed_iput, _info->delayed_iputs);
spin_unlock(_info->delayed_iput_lock);
+   if (!test_bit(BTRFS_FS_CLEANER_RUNNING, _info->flags))
+   wake_up_process(fs_info->cleaner_kthread);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
-- 
2.14.3



[PATCH 1/3] btrfs: run delayed iputs before committing

2018-12-03 Thread Josef Bacik
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8dfddfd3f315..0127d272cd2a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4953,6 +4953,15 @@ static void flush_space(struct btrfs_fs_info *fs_info,
ret = 0;
break;
case COMMIT_TRANS:
+   /*
+* If we have pending delayed iputs then we could free up a
+* bunch of pinned space, so make sure we run the iputs before
+* we do our pinned bytes check below.
+*/
+   mutex_lock(_info->cleaner_delayed_iput_mutex);
+   btrfs_run_delayed_iputs(fs_info);
+   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+
ret = may_commit_transaction(fs_info, space_info);
break;
default:
-- 
2.14.3



[PATCH 0/3][V2] Delayed iput fixes

2018-12-03 Thread Josef Bacik
v1->v2:
- only wakeup if the cleaner isn't currently doing work.
- re-arranged some stuff for running delayed iputs during flushint.
- removed the open code wakeup in the waitqueue patch.

-- Original message --

Here are some delayed iput fixes.  Delayed iputs can hold reservations for a
while and there's no real good way to make sure they were gone for good, which
means we could early enospc when in reality if we had just waited for the iput
we would have had plenty of space.  So fix this up by making us wait for delayed
iputs when deciding if we need to commit for enospc flushing, and then cleanup
and rework how we run delayed iputs to make it more straightforward to wait on
them and make sure we're all done using them.  Thanks,

Josef


[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-12-03 Thread Josef Bacik
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  4 +++-
 fs/btrfs/disk-io.c |  5 ++---
 fs/btrfs/extent-tree.c | 13 -
 fs/btrfs/inode.c   | 21 +
 4 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dc56a4d940c3..20af5d6d81f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -915,7 +915,8 @@ struct btrfs_fs_info {
 
spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
-   struct mutex cleaner_delayed_iput_mutex;
+   atomic_t nr_delayed_iputs;
+   wait_queue_head_t delayed_iputs_wait;
 
/* this protects tree_mod_seq_list */
spinlock_t tree_mod_seq_lock;
@@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
  u64 start, u64 num_bytes, u64 min_size,
  loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f40f6fdc1019..238e0113f2d3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg)
goto sleep;
}
 
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
 
again = btrfs_clean_one_deleted_snapshot(root);
mutex_unlock(_info->cleaner_mutex);
@@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->delete_unused_bgs_mutex);
mutex_init(_info->reloc_mutex);
mutex_init(_info->delalloc_root_mutex);
-   mutex_init(_info->cleaner_delayed_iput_mutex);
seqlock_init(_info->profiles_lock);
 
INIT_LIST_HEAD(_info->dirty_cowonly_roots);
@@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb,
atomic_set(_info->defrag_running, 0);
atomic_set(_info->qgroup_op_seq, 0);
atomic_set(_info->reada_works_cnt, 0);
+   atomic_set(_info->nr_delayed_iputs, 0);
atomic64_set(_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb,
init_waitqueue_head(_info->transaction_wait);
init_waitqueue_head(_info->transaction_blocked_wait);
init_waitqueue_head(_info->async_submit_wait);
+   init_waitqueue_head(_info->delayed_iputs_wait);
 
INIT_LIST_HEAD(_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0127d272cd2a..5b6c9fc227ff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
/*
 * The cleaner kthread might still be doing iput
 * operations. Wait for it to finish so that
-* more space is released.
+* more space is released.  We don't need to
+* explicitly run the delayed iputs here because
+* the commit_transaction would have woken up
+* the cleaner.
 */
-   
mutex_lock(_info->cleaner_delayed_iput_mutex);
-   
mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   ret = btrfs_wait_on_delayed_iputs(fs_info);
+   if (ret)
+   return ret;
goto again;
} else {
btrfs_end_transaction(trans);
@@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 * bunch of pinned space, so make sure we run the iputs before
 * we do our pinned by

[PATCH 8/8] btrfs: reserve extra space during evict()

2018-12-03 Thread Josef Bacik
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 623a71d871d4..8ac7abe2ae9b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
+   u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
int failures = 0;
 
for (;;) {
struct btrfs_trans_handle *trans;
int ret;
 
-   ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
+   ret = btrfs_block_rsv_refill(root, rsv,
+rsv->size + delayed_refs_extra,
 BTRFS_RESERVE_FLUSH_LIMIT);
 
if (ret && ++failures > 2) {
@@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
return ERR_PTR(-ENOSPC);
}
 
+   /*
+* Evict can generate a large amount of delayed refs without
+* having a way to add space back since we exhaust our temporary
+* block rsv.  We aren't allowed to do FLUSH_ALL in this case
+* because we could deadlock with so many things in the flushing
+* code, so we have to try and hold some extra space to
+* compensate for our delayed ref generation.  If we can't get
+* that space then we need see if we can steal our minimum from
+* the global reserve.  We will be ratelimited by the amount of
+* space we have for the delayed refs rsv, so we'll end up
+* committing and trying again.
+*/
trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans) || !ret)
+   if (IS_ERR(trans) || !ret) {
+   if (!IS_ERR(trans)) {
+   trans->block_rsv = _info->trans_block_rsv;
+   trans->bytes_reserved = delayed_refs_extra;
+   btrfs_block_rsv_migrate(rsv, trans->block_rsv,
+   delayed_refs_extra, 1);
+   }
return trans;
+   }
 
/*
 * Try to steal from the global reserve if there is space for
-- 
2.14.3



[PATCH 1/8] btrfs: check if free bgs for commit

2018-12-03 Thread Josef Bacik
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: Nikolay Borisov 
---
 fs/btrfs/extent-tree.c | 34 --
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07ef1b8087f7..755eb226d32d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4853,10 +4853,19 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes_needed)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes_needed,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+   trans = btrfs_join_transaction(fs_info->extent_root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+   /*
+* See if there is enough pinned space to make this reservation, or if
+* we have bg's that are going to be freed, allowing us to possibly do a
+* chunk allocation the next loop through.
+*/
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
+   __percpu_counter_compare(_info->total_bytes_pinned,
+bytes_needed,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
/*
@@ -4864,7 +4873,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * this reservation.
 */
if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   goto enospc;
 
spin_lock(_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
@@ -4878,17 +4887,14 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
bytes_needed -= reclaim_bytes;
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes_needed,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
+bytes_needed,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   goto enospc;
 commit:
-   trans = btrfs_join_transaction(fs_info->extent_root);
-   if (IS_ERR(trans))
-   return -ENOSPC;
-
return btrfs_commit_transaction(trans);
+enospc:
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
 }
 
 /*
-- 
2.14.3



[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-12-03 Thread Josef Bacik
With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/extent-tree.c   | 18 +-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 30da075c042e..7cf6ad021d81 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
FLUSH_DELALLOC  =   5,
FLUSH_DELALLOC_WAIT =   6,
ALLOC_CHUNK =   7,
-   COMMIT_TRANS=   8,
+   ALLOC_CHUNK_FORCE   =   8,
+   COMMIT_TRANS=   9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 667b992d322d..2d0dd70570ca 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4938,6 +4938,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
btrfs_end_transaction(trans);
break;
case ALLOC_CHUNK:
+   case ALLOC_CHUNK_FORCE:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -4945,7 +4946,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
ret = do_chunk_alloc(trans,
 btrfs_metadata_alloc_profile(fs_info),
-CHUNK_ALLOC_NO_FORCE);
+(state == ALLOC_CHUNK) ?
+CHUNK_ALLOC_NO_FORCE :
+CHUNK_ALLOC_FORCE);
btrfs_end_transaction(trans);
if (ret > 0 || ret == -ENOSPC)
ret = 0;
@@ -5081,6 +5084,19 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
commit_cycles--;
}
 
+   /*
+* We don't want to force a chunk allocation until we've tried
+* pretty hard to reclaim space.  Think of the case where we
+* free'd up a bunch of space and so have a lot of pinned space
+* to reclaim.  We would rather use that than possibly create a
+* underutilized metadata chunk.  So if this is our first run
+* through the flushing state machine skip ALLOC_CHUNK_FORCE and
+* commit the transaction.  If nothing has changed the next go
+* around then we can force a chunk allocation.
+*/
+   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+   flush_state++;
+
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
{ FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
+   { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3



[PATCH 0/8][V2] Enospc cleanups and fixeS

2018-12-03 Thread Josef Bacik
v1->v2:
- addressed comments from reviewers.
- fixed a bug in patch 6 that was introduced because of changes to upstream.

-- Original message --

The delayed refs rsv patches exposed a bunch of issues in our enospc
infrastructure that needed to be addressed.  These aren't really one coherent
group, but they are all around flushing and reservations.
may_commit_transaction() needed to be updated a little bit, and we needed to add
a new state to force chunk allocation if things got dicey.  Also because we can
end up needed to reserve a whole bunch of extra space for outstanding delayed
refs we needed to add the ability to only ENOSPC tickets that were too big to
satisfy, instead of failing all of the tickets.  There's also a fix in here for
one of the corner cases where we didn't quite have enough space reserved for the
delayed refs we were generating during evict().  Thanks,

Josef


[PATCH 2/8] btrfs: dump block_rsv whe dumping space info

2018-12-03 Thread Josef Bacik
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 755eb226d32d..204b35434056 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8063,6 +8063,15 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
+#define DUMP_BLOCK_RSV(fs_info, rsv_name)  \
+do {   \
+   struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name;   \
+   spin_lock(&__rsv->lock);\
+   btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu",  \
+  __rsv->size, __rsv->reserved);   \
+   spin_unlock(&__rsv->lock);  \
+} while (0)
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *info, u64 bytes,
int dump_block_groups)
@@ -8082,6 +8091,12 @@ static void dump_space_info(struct btrfs_fs_info 
*fs_info,
info->bytes_readonly);
spin_unlock(>lock);
 
+   DUMP_BLOCK_RSV(fs_info, global_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
+
if (!dump_block_groups)
return;
 
-- 
2.14.3



[PATCH 6/8] btrfs: loop in inode_rsv_refill

2018-12-03 Thread Josef Bacik
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 58 +-
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ee77a98f867..0e1a499035ac 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
+static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
+ u64 *metadata_bytes, u64 *qgroup_bytes)
+{
+   *metadata_bytes = 0;
+   *qgroup_bytes = 0;
+
+   spin_lock(_rsv->lock);
+   if (block_rsv->reserved < block_rsv->size)
+   *metadata_bytes = block_rsv->size - block_rsv->reserved;
+   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+   *qgroup_bytes = block_rsv->qgroup_rsv_size -
+   block_rsv->qgroup_rsv_reserved;
+   spin_unlock(_rsv->lock);
+}
+
 /**
  * btrfs_inode_rsv_refill - refill the inode block rsv.
  * @inode - the inode we are refilling.
@@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
 {
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = >block_rsv;
-   u64 num_bytes = 0;
+   u64 num_bytes = 0, last = 0;
u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
-   spin_lock(_rsv->lock);
-   if (block_rsv->reserved < block_rsv->size)
-   num_bytes = block_rsv->size - block_rsv->reserved;
-   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-   qgroup_num_bytes = block_rsv->qgroup_rsv_size -
-  block_rsv->qgroup_rsv_reserved;
-   spin_unlock(_rsv->lock);
-
+   __get_refill_bytes(block_rsv, _bytes, _num_bytes);
if (num_bytes == 0)
return 0;
 
-   ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
-   if (ret)
-   return ret;
-   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   do {
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, 
qgroup_num_bytes, true);
+   if (ret)
+   return ret;
+   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   if (ret) {
+   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   last = num_bytes;
+   /*
+* If we are fragmented we can end up with a lot of
+* outstanding extents which will make our size be much
+* larger than our reserved amount.  If we happen to
+* try to do a reservation here that may result in us
+* trying to do a pretty hefty reservation, which we may
+* not need once delalloc flushing happens.  If this is
+* the case try and do the reserve again.
+*/
+   if (flush == BTRFS_RESERVE_FLUSH_ALL)
+   __get_refill_bytes(block_rsv, _bytes,
+  _num_bytes);
+   if (num_bytes == 0)
+   return 0;
+   }
+   } while (ret && last != num_bytes);
+
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, false);
trace_btrfs_space_reservation(root->fs_info, "delalloc",
@@ -5830,8 +5859,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
spin_lock(_rsv->lock);
block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
spin_unlock(_rsv->lock);
-   } else
-   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   }
return ret;
 }
 
-- 
2.14.3



[PATCH 3/8] btrfs: don't use global rsv for chunk allocation

2018-12-03 Thread Josef Bacik
The should_alloc_chunk code has math in it to decide if we're getting
short on space and if we should go ahead and pre-emptively allocate a
new chunk.  Previously when we did not have the delayed_refs_rsv, we had
to assume that the global block rsv was essentially used space and could
be allocated completely at any time, so we counted this space as "used"
when determining if we had enough slack space in our current space_info.
But on any slightly used file system (10gib or more) you can have a
global reserve of 512mib.  With our default chunk size being 1gib that
means we just assume half of the block group is used, which could result
in us allocating more metadata chunks than is actually required.

With the delayed refs rsv we can flush delayed refs as the space becomes
tight, and if we actually need more block groups then they will be
allocated based on space pressure.  We no longer require assuming the
global reserve is used space in our calculations.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 204b35434056..667b992d322d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4398,21 +4398,12 @@ static inline u64 calc_global_rsv_need_space(struct 
btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
-   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
u64 bytes_used = btrfs_space_info_used(sinfo, false);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
return 1;
 
-   /*
-* We need to take into account the global rsv because for all intents
-* and purposes it's used space.  Don't worry about locking the
-* global_rsv, it doesn't change except when the transaction commits.
-*/
-   if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   bytes_used += calc_global_rsv_need_space(global_rsv);
-
/*
 * in limited mode, we want to have some free space up to
 * about 1% of the FS size.
-- 
2.14.3



[PATCH 7/8] btrfs: be more explicit about allowed flush states

2018-12-03 Thread Josef Bacik
For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when
running delalloc because we may be holding a tree lock.  We can also
deadlock with delayed refs rsv's that are running via the committing
mechanism.  The only safe operations for FLUSH_LIMIT is to run the
delayed operations and to allocate chunks, everything else has the
potential to deadlock.  Future proof this by explicitly specifying the
states that FLUSH_LIMIT is allowed to use.  This will keep us from
introducing bugs later on when adding new flush states.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0e1a499035ac..ab9d915d9289 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5123,12 +5123,18 @@ void btrfs_init_async_reclaim_work(struct work_struct 
*work)
INIT_WORK(work, btrfs_async_reclaim_metadata_space);
 }
 
+static const enum btrfs_flush_state priority_flush_states[] = {
+   FLUSH_DELAYED_ITEMS_NR,
+   FLUSH_DELAYED_ITEMS,
+   ALLOC_CHUNK,
+};
+
 static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
struct reserve_ticket *ticket)
 {
u64 to_reclaim;
-   int flush_state = FLUSH_DELAYED_ITEMS_NR;
+   int flush_state = 0;
 
spin_lock(_info->lock);
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
@@ -5140,7 +5146,8 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
spin_unlock(_info->lock);
 
do {
-   flush_space(fs_info, space_info, to_reclaim, flush_state);
+   flush_space(fs_info, space_info, to_reclaim,
+   priority_flush_states[flush_state]);
flush_state++;
spin_lock(_info->lock);
if (ticket->bytes == 0) {
@@ -5148,15 +5155,7 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
return;
}
spin_unlock(_info->lock);
-
-   /*
-* Priority flushers can't wait on delalloc without
-* deadlocking.
-*/
-   if (flush_state == FLUSH_DELALLOC ||
-   flush_state == FLUSH_DELALLOC_WAIT)
-   flush_state = ALLOC_CHUNK;
-   } while (flush_state < COMMIT_TRANS);
+   } while (flush_state < ARRAY_SIZE(priority_flush_states));
 }
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
-- 
2.14.3



[PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-12-03 Thread Josef Bacik
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2d0dd70570ca..0ee77a98f867 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4801,6 +4801,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+   u64 orig_bytes;
u64 bytes;
int error;
struct list_head list;
@@ -5023,7 +5024,7 @@ static inline int need_do_async_reclaim(struct 
btrfs_fs_info *fs_info,
!test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
struct reserve_ticket *ticket;
 
@@ -5032,7 +5033,10 @@ static void wake_all_tickets(struct list_head *head)
list_del_init(>list);
ticket->error = -ENOSPC;
wake_up(>wait);
+   if (ticket->bytes != ticket->orig_bytes)
+   return true;
}
+   return false;
 }
 
 /*
@@ -5100,8 +5104,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
-   wake_all_tickets(_info->tickets);
-   space_info->flush = 0;
+   if (wake_all_tickets(_info->tickets)) {
+   flush_state = FLUSH_DELAYED_ITEMS_NR;
+   commit_cycles--;
+   } else {
+   space_info->flush = 0;
+   }
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
}
@@ -5153,10 +5161,11 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info,
-  struct reserve_ticket *ticket, u64 orig_bytes)
+  struct reserve_ticket *ticket)
 
 {
DEFINE_WAIT(wait);
+   u64 reclaim_bytes = 0;
int ret = 0;
 
spin_lock(_info->lock);
@@ -5177,14 +5186,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
ret = ticket->error;
if (!list_empty(>list))
list_del_init(>list);
-   if (ticket->bytes && ticket->bytes < orig_bytes) {
-   u64 num_bytes = orig_bytes - ticket->bytes;
-   update_bytes_may_use(space_info, -num_bytes);
-   trace_btrfs_space_reservation(fs_info, "space_info",
- space_info->flags, num_bytes, 0);
-   }
+   if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+   reclaim_bytes = ticket->orig_bytes - ticket->bytes;
spin_unlock(_info->lock);
 
+   if (reclaim_bytes)
+   space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
return ret;
 }
 
@@ -5210,6 +5217,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket ticket;
u64 used;
+   u64 reclaim_bytes = 0;
int ret = 0;
 
ASSERT(orig_bytes);
@@ -5245,6 +5253,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 * the list and we will do our own flushing further down.
 */
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+   ticket.orig_bytes = orig_bytes;
ticket.bytes = orig_bytes;
ticket.error = 0;
init_waitqueue_head();
@@ -5285,25 +5294,21 @@ static int __reserve_metadata_bytes(struct 
btrfs_fs_info *fs_info,
return ret;
 
if (flush == BTRFS_RESERVE_FLUSH_ALL)
-   return wait_reserve_ticket(fs_info, sp

[PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-03 Thread Josef Bacik
Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need.  This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv.  If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/extent-tree.c | 48 ++--
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/transaction.c |  2 +-
 4 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2eba398c722b..30da075c042e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
btrfs_fs_info *fs_info,
 }
 
 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
 void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 const u64 start);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5a2d0b061f57..07ef1b8087f7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
*fs_info, u64 csum_bytes)
return num_csums;
 }
 
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_block_rsv *global_rsv;
-   u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
-   u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
-   unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
-   u64 num_bytes, num_dirty_bgs_bytes;
-   int ret = 0;
+   struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
+   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
+   bool ret = false;
+   u64 reserved;
 
-   num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
-   num_heads = heads_to_leaves(fs_info, num_heads);
-   if (num_heads > 1)
-   num_bytes += (num_heads - 1) * fs_info->nodesize;
-   num_bytes <<= 1;
-   num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
-   fs_info->nodesize;
-   num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
-num_dirty_bgs);
-   global_rsv = _info->global_block_rsv;
+   spin_lock(_rsv->lock);
+   reserved = global_rsv->reserved;
+   spin_unlock(_rsv->lock);
 
/*
-* If we can't allocate any more chunks lets make sure we have _lots_ of
-* wiggle room since running delayed refs can create more delayed refs.
+* Since the global reserve is just kind of magic we don't really want
+* to rely on it to save our bacon, so if our size is more than the
+* delayed_refs_rsv and the global rsv then it's time to think about
+* bailing.
 */
-   if (global_rsv->space_info->full) {
-   num_dirty_bgs_bytes <<= 1;
-   num_bytes <<= 1;
-   }
-
-   spin_lock(_rsv->lock);
-   if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
-   ret = 1;
-   spin_unlock(_rsv->lock);
+   spin_lock(_refs_rsv->lock);
+   reserved += delayed_refs_rsv->reserved;
+   if (delayed_refs_rsv->size >= reserved)
+   ret = true;
+   spin_unlock(_refs_rsv->lock);
return ret;
 }
 
@@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
btrfs_trans_handle *trans)
if (val >= NSEC_PER_SEC / 2)
return 2;
 
-   return btrfs_check_space_for_delayed_refs(trans);
+   return btrfs_check_space_for_delayed_refs(trans->fs_info);
 }
 
 struct async_delayed_refs {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a097f5fde31d..8532a2eb56d1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 * Try to steal from the global reserve if there is space for
 * it.
 */
-   if (!btrfs_check_space_for_delayed_refs(trans) &&
-   !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, false))
+   

[PATCH 10/10] btrfs: fix truncate throttling

2018-12-03 Thread Josef Bacik
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 79 
 1 file changed, 17 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8532a2eb56d1..623a71d871d4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 bytes_deleted)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret;
-
-   /*
-* This is only used to apply pressure to the enospc system, we don't
-* intend to use this reservation at all.
-*/
-   bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-   bytes_deleted *= fs_info->nodesize;
-   ret = btrfs_block_rsv_add(root, _info->trans_block_rsv,
- bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-   if (!ret) {
-   trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid,
- bytes_deleted, 1);
-   trans->bytes_reserved += bytes_deleted;
-   }
-   return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
-   bool should_end = false;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
btrfs_abort_transaction(trans, ret);
break;
}
-   if (btrfs_should_throttle_delayed_refs(trans))
-   btrfs_async_run_delayed_refs(fs_info,
-   trans->delayed_ref_updates * 2,
-   trans->transid, 0);
if (be_nice) {
-   if (truncate_space_check(trans, root,
-extent_num_bytes)) {
-   should_end = true;
-   }
if (btrfs_should_throttle_delayed_refs(trans))
should_throttle = true;
}
@@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
if (path->slots[0] == 0 ||
path->slots[0] != pending_del_slot ||
-   should_throttle || should_end) {
+   should_throttle) {
if (pending_del_nr) {
ret = btrfs_del_items(trans, root, path,
pending_del_slot,
@@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
pending_del_nr = 0;
}
btrfs_release_path(path);
-   if (should_throttle) {
-   unsigned long updates = 
trans->delayed_ref_updates;
-   if (updates) {
-   trans->delayed_ref_updates = 0;
-   ret = btrfs_run_delayed_refs(trans,
-  updates * 2);
-   if (ret)
-   break;
-   }
-   }
+
/*
-* if we failed to refill our space rsv, bail out
-* and let the transaction restart
+* We can generate a lot of delayed refs, so we need to
+* throttle every once and a while and make sure we're
+* adding enough space to keep up with the work we are
+* generating.  Since we hold a transaction here we
+* can't flush, and we don't want to FLUSH_LIMIT because
+* we could have generated too many delayed refs to
+* actually allocate, so just bail if we're short and
+   

[PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-03 Thread Josef Bacik
Over the years we have built up a lot of infrastructure to keep delayed
refs in check, mostly by running them at btrfs_end_transaction() time.
We have a lot of different maths we do to figure out how much, if we
should do it inline or async, etc.  This existed because we had no
feedback mechanism to force the flushing of delayed refs when they
became a problem.  However with the enospc flushing infrastructure in
place for flushing delayed refs when they put too much pressure on the
enospc system we have this problem solved.  Rip out all of this code as
it is no longer needed.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 38 --
 1 file changed, 38 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2d8401bf8df9..01f39401619a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -798,22 +798,12 @@ static int should_end_transaction(struct 
btrfs_trans_handle *trans)
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
 {
struct btrfs_transaction *cur_trans = trans->transaction;
-   int updates;
-   int err;
 
smp_mb();
if (cur_trans->state >= TRANS_STATE_BLOCKED ||
cur_trans->delayed_refs.flushing)
return 1;
 
-   updates = trans->delayed_ref_updates;
-   trans->delayed_ref_updates = 0;
-   if (updates) {
-   err = btrfs_run_delayed_refs(trans, updates * 2);
-   if (err) /* Error code will also eval true */
-   return err;
-   }
-
return should_end_transaction(trans);
 }
 
@@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
 {
struct btrfs_fs_info *info = trans->fs_info;
struct btrfs_transaction *cur_trans = trans->transaction;
-   u64 transid = trans->transid;
-   unsigned long cur = trans->delayed_ref_updates;
int lock = (trans->type != TRANS_JOIN_NOLOCK);
int err = 0;
-   int must_run_delayed_refs = 0;
 
if (refcount_read(>use_count) > 1) {
refcount_dec(>use_count);
@@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
-
-   trans->delayed_ref_updates = 0;
-   if (!trans->sync) {
-   must_run_delayed_refs =
-   btrfs_should_throttle_delayed_refs(trans);
-   cur = max_t(unsigned long, cur, 32);
-
-   /*
-* don't make the caller wait if they are from a NOLOCK
-* or ATTACH transaction, it will deadlock with commit
-*/
-   if (must_run_delayed_refs == 1 &&
-   (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
-   must_run_delayed_refs = 2;
-   }
-
-   btrfs_trans_release_metadata(trans);
-   trans->block_rsv = NULL;
-
if (!list_empty(>new_bgs))
btrfs_create_pending_block_groups(trans);
 
@@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
}
 
kmem_cache_free(btrfs_trans_handle_cachep, trans);
-   if (must_run_delayed_refs) {
-   btrfs_async_run_delayed_refs(info, cur, transid,
-must_run_delayed_refs == 1);
-   }
return err;
 }
 
-- 
2.14.3



[PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-03 Thread Josef Bacik
From: Josef Bacik 

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 67 +-
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c36b3a42f2bb..e3ed3507018d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+
+   if (head->total_ref_mod < 0) {
+   struct btrfs_space_info *space_info;
+   u64 flags;
+
+   if (head->is_data)
+   flags = BTRFS_BLOCK_GROUP_DATA;
+   else if (head->is_system)
+   flags = BTRFS_BLOCK_GROUP_SYSTEM;
+   else
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   space_info = __find_space_info(fs_info, flags);
+   ASSERT(space_info);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+   if (head->is_data) {
+   spin_lock(_refs->lock);
+   delayed_refs->pending_csums -= head->num_bytes;
+   spin_unlock(_refs->lock);
+   }
+   }
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+ head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *head)
 {
@@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-   trace_run_delayed_ref_head(fs_info, head, 0);
-
-   if (head->total_ref_mod < 0) {
-   struct btrfs_space_info *space_info;
-   u64 flags;
-
-   if (head->is_data)
-   flags = BTRFS_BLOCK_GROUP_DATA;
-   else if (head->is_system)
-   flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   else
-   flags = BTRFS_BLOCK_GROUP_METADATA;
-   space_info = __find_space_info(fs_info, flags);
-   ASSERT(space_info);
-   percpu_counter_add_batch(_info->total_bytes_pinned,
-  -head->num_bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-   if (head->is_data) {
-   spin_lock(_refs->lock);
-   delayed_refs->pending_csums -= head->num_bytes;
-   spin_unlock(_refs->lock);
-   }
-   }
-
if (head->must_insert_reserved) {
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
@@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   /* Also free its reserved qgroup space */
-   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
- head->qgroup_reserved);
+   cleanup_ref_head_accounting(trans, head);
+
+   trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
btrfs_put_delayed_ref_head(head);
return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
+   cleanup_ref_head_accounting(trans, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 00/10][V2] Delayed refs rsv

2018-12-03 Thread Josef Bacik
v1->v2:
- addressed the comments from the various reviewers.
- split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
  together as they were, just split out more logically.  They can't really be
  bisected across in that you will likely have fun enospc failures, but they
  compile properly.  This was done to make it easier for review.

-- Original message --

This patchset changes how we do space reservations for delayed refs.  We were
hitting probably 20-40 enospc abort's per day in production while running
delayed refs at transaction commit time.  This means we ran out of space in the
global reserve and couldn't easily get more space in use_block_rsv().

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
we're running short on space and when it's time to force a commit.  A failure
rate of 20-40 file systems when we run hundreds of thousands of them isn't super
high, but cleaning up this code will make things less ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the global
reserve for that spill over, which fits better into it's desired use than a full
blown reservation.  This first approach is to simply take how many times we're
reserving space for and multiply that by 2 in order to save enough space for the
delayed refs that could be generated.  This is a niave approach and will
probably evolve, but for now it works.

With this patchset we've gone down to 2-8 failures per week.  It's not perfect,
there are some corner cases that still need to be addressed, but is
significantly better than what we had.  Thanks,

Josef


[PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-03 Thread Josef Bacik
Any space used in the delayed_refs_rsv will be freed up by a transaction
commit, so instead of just counting the pinned space we also need to
account for any space in the delayed_refs_rsv when deciding if it will
make a different to commit the transaction to satisfy our space
reservation.  If we have enough bytes to satisfy our reservation ticket
then we are good to go, otherwise subtract out what space we would gain
back by committing the transaction and compare that against the pinned
space to make our decision.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index aa0a638d0263..63ff9d832867 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket *ticket = NULL;
struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
+   struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
struct btrfs_trans_handle *trans;
-   u64 bytes;
+   u64 bytes_needed;
+   u64 reclaim_bytes = 0;
 
trans = (struct btrfs_trans_handle *)current->journal_info;
if (trans)
@@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
else if (!list_empty(_info->tickets))
ticket = list_first_entry(_info->tickets,
  struct reserve_ticket, list);
-   bytes = (ticket) ? ticket->bytes : 0;
+   bytes_needed = (ticket) ? ticket->bytes : 0;
spin_unlock(_info->lock);
 
-   if (!bytes)
+   if (!bytes_needed)
return 0;
 
/* See if there is enough pinned space to make this reservation */
if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
+  bytes_needed,
   BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
@@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
return -ENOSPC;
 
spin_lock(_rsv->lock);
-   if (delayed_rsv->size > bytes)
-   bytes = 0;
-   else
-   bytes -= delayed_rsv->size;
+   reclaim_bytes += delayed_rsv->reserved;
spin_unlock(_rsv->lock);
 
+   spin_lock(_refs_rsv->lock);
+   reclaim_bytes += delayed_refs_rsv->reserved;
+   spin_unlock(_refs_rsv->lock);
+   if (reclaim_bytes >= bytes_needed)
+   goto commit;
+   bytes_needed -= reclaim_bytes;
+
if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
+  bytes_needed,
   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
return -ENOSPC;
}
-- 
2.14.3



[PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-03 Thread Josef Bacik
From: Josef Bacik 

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  14 +++-
 fs/btrfs/delayed-ref.c |  43 --
 fs/btrfs/disk-io.c |   4 +
 fs/btrfs/extent-tree.c | 212 +
 fs/btrfs/transaction.c |  37 -
 5 files changed, 284 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b41ec42f405..52a87d446945 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -448,8 +448,9 @@ struct btrfs_space_info {
 #defineBTRFS_BLOCK_RSV_TRANS   3
 #defineBTRFS_BLOCK_RSV_CHUNK   4
 #defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+#define BTRFS_BLOCK_RSV_DELREFS6
+#defineBTRFS_BLOCK_RSV_EMPTY   7
+#defineBTRFS_BLOCK_RSV_TEMP8
 
 struct btrfs_block_rsv {
u64 size;
@@ -812,6 +813,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv chunk_block_rsv;
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
+   /* block reservation for delayed refs */
+   struct btrfs_block_rsv delayed_refs_rsv;
 
struct btrfs_block_rsv empty_block_rsv;
 
@@ -2796,6 +2799,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
+ enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+  struct btrfs_block_rsv *src,
+  u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 48725fa757a3..a198ab91879c 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -474,11 +474,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
  * existing and update must have the same bytenr
  */
 static noinline void
-update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
+update_existing_head_ref(struct btrfs_trans_handle *trans,
 struct btrfs_delayed_ref_head *existing,
 struct btrfs_delayed_ref_head *update,
 int *old_ref_mod_ret)
 {
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod;
 
BUG_ON(existing->is_data != update->is_data);
@@ -536,10 +539,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root 
*delayed_refs,
 * versa we need to make sure to adjust pending_csums accordingly.
 */
if (existing->is_data) {
-   if (existing->total_ref_mod >= 0 && old_ref_mod < 0)
+   u64 csum_leaves =
+   btrfs_csum_bytes_to_leaves(fs_info,
+  existing->num_bytes);
+
+   if (existing->total_ref_mod >= 0 && old_ref_mod < 0) {
   

[PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-03 Thread Josef Bacik
From: Josef Bacik 

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index b3e4c9fcb664..48725fa757a3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
ref->in_tree = 0;
btrfs_put_delayed_ref(ref);
atomic_dec(_refs->num_entries);
-   if (trans->delayed_ref_updates)
-   trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
if (ref->action == BTRFS_ADD_DELAYED_REF)
list_add_tail(>add_list, >ref_add_list);
atomic_inc(>num_entries);
-   trans->delayed_ref_updates++;
spin_unlock(>lock);
return ret;
 }
-- 
2.14.3



[PATCH 03/10] btrfs: cleanup extent_op handling

2018-12-03 Thread Josef Bacik
From: Josef Bacik 

The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Reviewed-by: Lu Fengqi 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e3ed3507018d..9f169f3c5fdb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2424,19 +2424,32 @@ static void unselect_delayed_ref_head(struct 
btrfs_delayed_ref_root *delayed_ref
btrfs_delayed_ref_unlock(head);
 }
 
-static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head)
+static struct btrfs_delayed_extent_op *cleanup_extent_op(
+   struct btrfs_delayed_ref_head *head)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
-   int ret;
 
if (!extent_op)
-   return 0;
-   head->extent_op = NULL;
+   return NULL;
+
if (head->must_insert_reserved) {
+   head->extent_op = NULL;
btrfs_free_delayed_extent_op(extent_op);
-   return 0;
+   return NULL;
}
+   return extent_op;
+}
+
+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
+struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_delayed_extent_op *extent_op;
+   int ret;
+
+   extent_op = cleanup_extent_op(head);
+   if (!extent_op)
+   return 0;
+   head->extent_op = NULL;
spin_unlock(>lock);
ret = run_delayed_extent_op(trans, head, extent_op);
btrfs_free_delayed_extent_op(extent_op);
@@ -2488,7 +2501,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
 
delayed_refs = >transaction->delayed_refs;
 
-   ret = cleanup_extent_op(trans, head);
+   ret = run_and_cleanup_extent_op(trans, head);
if (ret < 0) {
unselect_delayed_ref_head(delayed_refs, head);
btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6990,8 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!RB_EMPTY_ROOT(>ref_tree.rb_root))
goto out;
 
-   if (head->extent_op) {
-   if (!head->must_insert_reserved)
-   goto out;
-   btrfs_free_delayed_extent_op(head->extent_op);
-   head->extent_op = NULL;
-   }
+   if (cleanup_extent_op(head) != NULL)
+   goto out;
 
/*
 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3



[PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-03 Thread Josef Bacik
From: Josef Bacik 

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/delayed-ref.c | 14 ++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 22 +++---
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..b3e4c9fcb664 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head)
+{
+   lockdep_assert_held(_refs->lock);
+   lockdep_assert_held(>lock);
+
+   rb_erase_cached(>href_node, _refs->href_root);
+   RB_CLEAR_NODE(>href_node);
+   atomic_dec(_refs->num_entries);
+   delayed_refs->num_heads--;
+   if (head->processing == 0)
+   delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..d2af974f68a1 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 {
mutex_unlock(>mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *btrfs_select_ref_head(
struct btrfs_delayed_ref_root *delayed_refs);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d242a1174e50..c36b3a42f2bb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(_refs->lock);
return 1;
}
-   delayed_refs->num_heads--;
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
-   atomic_dec(_refs->num_entries);
 
trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!mutex_trylock(>mutex))
goto out;
 
-   /*
-* at this point we have a head with no other entries.  Go
-* ahead and process it.
-*/
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
-   atomic_dec(_refs->num_entries);
-
-   /*
-* we don't take a ref on the node because we're removing it from the
-* tree, so we just steal the ref the tree was holding.
-*/
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
+   btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
+
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-- 
2.14.3



[PATCH 07/10] btrfs: add new flushing states for the delayed refs rsv

2018-12-03 Thread Josef Bacik
A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure.  Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h | 10 ++
 fs/btrfs/extent-tree.c   | 14 ++
 include/trace/events/btrfs.h |  2 ++
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 52a87d446945..2eba398c722b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2745,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR  =   1,
FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
+   FLUSH_DELAYED_REFS_NR   =   3,
+   FLUSH_DELAYED_REFS  =   4,
+   FLUSH_DELALLOC  =   5,
+   FLUSH_DELALLOC_WAIT =   6,
+   ALLOC_CHUNK =   7,
+   COMMIT_TRANS=   8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 63ff9d832867..5a2d0b061f57 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4938,6 +4938,20 @@ static void flush_space(struct btrfs_fs_info *fs_info,
shrink_delalloc(fs_info, num_bytes * 2, num_bytes,
state == FLUSH_DELALLOC_WAIT);
break;
+   case FLUSH_DELAYED_REFS_NR:
+   case FLUSH_DELAYED_REFS:
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   break;
+   }
+   if (state == FLUSH_DELAYED_REFS_NR)
+   nr = calc_reclaim_items_nr(fs_info, num_bytes);
+   else
+   nr = 0;
+   btrfs_run_delayed_refs(trans, nr);
+   btrfs_end_transaction(trans);
+   break;
case ALLOC_CHUNK:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8568946f491d..63d1f9d8b8c7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1048,6 +1048,8 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_ITEMS,  "FLUSH_DELAYED_ITEMS"}, 
\
{ FLUSH_DELALLOC,   "FLUSH_DELALLOC"},  
\
{ FLUSH_DELALLOC_WAIT,  "FLUSH_DELALLOC_WAIT"}, 
\
+   { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
+   { FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
-- 
2.14.3



Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Josef Bacik
On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote:
> On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> >
> > From: Josef Bacik 
> >
> > When debugging some weird extent reference bug I suspected that we were
> > changing a snapshot while we were deleting it, which could explain my
> > bug.  This was indeed what was happening, and this patch helped me
> > verify my theory.  It is never correct to modify the snapshot once it's
> > being deleted, so mark the root when we are deleting it and make sure we
> > complain about it when it happens.
> >
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.c   | 3 +++
> >  fs/btrfs/ctree.h   | 1 +
> >  fs/btrfs/extent-tree.c | 9 +
> >  3 files changed, 13 insertions(+)
> >
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index 5912a97b07a6..5f82f86085e8 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct 
> > btrfs_trans_handle *trans,
> > u64 search_start;
> > int ret;
> >
> > +   if (test_bit(BTRFS_ROOT_DELETING, >state))
> > +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
> > dropped\n");
> 
> Please use btrfs_warn(), it makes sure we use a consistent message
> style, identifies the fs, etc.
> Also, "thats" should be "that is" or "that's".
> 

Ah yeah, I was following the other convention in there but we should probably
convert all of those to btrfs_warn.  I'll fix the grammer thing as well, just a
leftover from the much less code of conduct friendly message I originally had
there.  Thanks,

Josef


[PATCH 0/2] Fix aborts when dropping snapshots

2018-11-30 Thread Josef Bacik
With more machines running my recent delayed ref rsv patches we started seeing a
spike in boxes aborting in __btrfs_free_extent with being unable to find the
extent ref.  The full details are in 2/2, but the summary is there's been a bug
ever since the original delayed inode item stuff was introduced where it could
run once the snapshot was being deleted, which will result in all sorts of
extent reference shenanigans.  This was tricky to hit before, but with my iput
changes it's become much easier to hit on our build boxes that are heavy users
of snapshot creation/deletion.  Thanks,

Josef


[PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-11-30 Thread Josef Bacik
From: Josef Bacik 

When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug.  This was indeed what was happening, and this patch helped me
verify my theory.  It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.c   | 3 +++
 fs/btrfs/ctree.h   | 1 +
 fs/btrfs/extent-tree.c | 9 +
 3 files changed, 13 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 5912a97b07a6..5f82f86085e8 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
u64 search_start;
int ret;
 
+   if (test_bit(BTRFS_ROOT_DELETING, >state))
+   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
dropped\n");
+
if (trans->transaction != fs_info->running_transaction)
WARN(1, KERN_CRIT "trans %llu running %llu\n",
   trans->transid,
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index facde70c15ed..5a3a94ccb65c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1199,6 +1199,7 @@ enum {
BTRFS_ROOT_FORCE_COW,
BTRFS_ROOT_MULTI_LOG_TASKS,
BTRFS_ROOT_DIRTY,
+   BTRFS_ROOT_DELETING,
 };
 
 /*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 581c2a0b2945..dcb699dd57f3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9333,6 +9333,15 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
if (block_rsv)
trans->block_rsv = block_rsv;
 
+   /*
+* This will help us catch people modifying the fs tree while we're
+* dropping it.  It is unsafe to mess with the fs tree while it's being
+* dropped as we unlock the root node and parent nodes as we walk down
+* the tree, assuming nothing will change.  If something does change
+* then we'll have stale information and drop references to blocks we've
+* already dropped.
+*/
+   set_bit(BTRFS_ROOT_DELETING, >state);
if (btrfs_disk_key_objectid(_item->drop_progress) == 0) {
level = btrfs_header_level(root->node);
path->nodes[level] = btrfs_lock_root_node(root);
-- 
2.14.3



[PATCH 2/2] btrfs: run delayed items before dropping the snapshot

2018-11-30 Thread Josef Bacik
From: Josef Bacik 

With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent

BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
35964  owner 1 offset 0
Call Trace:
 ? btrfs_merge_delayed_refs+0xaf/0x340
 __btrfs_run_delayed_refs+0x6ea/0xfc0
 ? btrfs_set_path_blocking+0x31/0x60
 btrfs_run_delayed_refs+0xeb/0x180
 btrfs_commit_transaction+0x179/0x7f0
 ? btrfs_check_space_for_delayed_refs+0x30/0x50
 ? should_end_transaction.isra.19+0xe/0x40
 btrfs_drop_snapshot+0x41c/0x7c0
 btrfs_clean_one_deleted_snapshot+0xb5/0xd0
 cleaner_kthread+0xf6/0x120
 kthread+0xf8/0x130
 ? btree_invalidatepage+0x90/0x90
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40

This was because btrfs_drop_snapshot depends on the root not being modified
while it's dropping the snapshot.  It will unlock the root node (and really
every node) as it walks down the tree, only to re-lock it when it needs to do
something.  This is a problem because if we modify the tree we could cow a block
in our path, which free's our reference to that block.  Then once we get back to
that shared block we'll free our reference to it again, and get ENOENT when
trying to lookup our extent reference to that block in __btrfs_free_extent.

This is ultimately happening because we have delayed items left to be processed
for our deleted snapshot _after_ all of the inodes are closed for the snapshot.
We only run the delayed inode item if we're deleting the inode, and even then we
do not run the delayed insertions or delayed removals.  These can be run at any
point after our final inode does it's last iput, which is what triggers the
snapshot deletion.  We can end up with the snapshot deletion happening and then
have the delayed items run on that file system, resulting in the above problem.

This problem has existed forever, however my patches made it much easier to hit
as I wake up the cleaner much more often to deal with delayed iputs, which made
us more likely to start the snapshot dropping work before the transaction
commits, which is when the delayed items would generally be run.  Before,
generally speaking, we would run the delayed items, commit the transaction, and
wakeup the cleaner thread to start deleting snapshots, which means we were less
likely to hit this problem.  You could still hit it if you had multiple
snapshots to be deleted and ended up with lots of delayed items, but it was
definitely harder.

Fix for now by simply running all the delayed items before starting to drop the
snapshot.  We could make this smarter in the future by making the delayed items
per-root, and then simply drop any delayed items for roots that we are going to
delete.  But for now just a quick and easy solution is the safest.

Cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dcb699dd57f3..965702034b22 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9330,6 +9330,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
goto out_free;
}
 
+   btrfs_run_delayed_items(trans);
+
if (block_rsv)
trans->block_rsv = block_rsv;
 
-- 
2.14.3



Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Josef Bacik
On Tue, Nov 27, 2018 at 07:59:42PM +, Chris Mason wrote:
> On 27 Nov 2018, at 14:54, Josef Bacik wrote:
> 
> > On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
> >>
> >>
> >> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> >>> The cleaner thread usually takes care of delayed iputs, with the
> >>> exception of the btrfs_end_transaction_throttle path.  The cleaner
> >>> thread only gets woken up every 30 seconds, so instead wake it up to 
> >>> do
> >>> it's work so that we can free up that space as quickly as possible.
> >>
> >> Have you done any measurements how this affects the overall system. I
> >> suspect this introduces a lot of noise since now we are going to be
> >> doing a thread wakeup on every iput, does this give a chance to have
> >> nice, large batches of iputs that  the cost of wake up can be 
> >> amortized
> >> across?
> >
> > I ran the whole patchset with our A/B testing stuff and the patchset 
> > was a 5%
> > win overall, so I'm inclined to think it's fine.  Thanks,
> 
> It's a good point though, a delayed wakeup may be less overhead.

Sure, but how do we go about that without it sometimes messing up?  In practice
the only time we're doing this is at the end of finish_ordered_io, so likely to
not be a constant stream.  I suppose since we have places where we force it to
run that we don't really need this.  IDK, I'm fine with dropping it.  Thanks,

Josef


Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-11-27 Thread Josef Bacik
On Tue, Nov 27, 2018 at 10:29:57AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> 
> Which one is the throttle path? btrfs_end_transaction_throttle is only
> called during snapshot drop and relocation? What scenario triggered your
> observation and prompted this fix?
> 

One of my enospc tests runs snapshot creation/deletion in the background.

> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h   |  4 +++-
> >  fs/btrfs/disk-io.c |  5 ++---
> >  fs/btrfs/extent-tree.c |  9 +
> >  fs/btrfs/inode.c   | 21 +
> >  4 files changed, 31 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 709de7471d86..a835fe7076eb 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -912,7 +912,8 @@ struct btrfs_fs_info {
> >  
> > spinlock_t delayed_iput_lock;
> > struct list_head delayed_iputs;
> > -   struct mutex cleaner_delayed_iput_mutex;
> > +   atomic_t nr_delayed_iputs;
> > +   wait_queue_head_t delayed_iputs_wait;
> >  
> > /* this protects tree_mod_seq_list */
> > spinlock_t tree_mod_seq_lock;
> > @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
> >  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
> >  void btrfs_add_delayed_iput(struct inode *inode);
> >  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
> > +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
> >  int btrfs_prealloc_file_range(struct inode *inode, int mode,
> >   u64 start, u64 num_bytes, u64 min_size,
> >   loff_t actual_len, u64 *alloc_hint);
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index c5918ff8241b..3f81dfaefa32 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
> > goto sleep;
> > }
> >  
> > -   mutex_lock(_info->cleaner_delayed_iput_mutex);
> > btrfs_run_delayed_iputs(fs_info);
> > -   mutex_unlock(_info->cleaner_delayed_iput_mutex);
> >  
> > again = btrfs_clean_one_deleted_snapshot(root);
> > mutex_unlock(_info->cleaner_mutex);
> > @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
> > mutex_init(_info->delete_unused_bgs_mutex);
> > mutex_init(_info->reloc_mutex);
> > mutex_init(_info->delalloc_root_mutex);
> > -   mutex_init(_info->cleaner_delayed_iput_mutex);
> > seqlock_init(_info->profiles_lock);
> >  
> > INIT_LIST_HEAD(_info->dirty_cowonly_roots);
> > @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
> > atomic_set(_info->defrag_running, 0);
> > atomic_set(_info->qgroup_op_seq, 0);
> > atomic_set(_info->reada_works_cnt, 0);
> > +   atomic_set(_info->nr_delayed_iputs, 0);
> > atomic64_set(_info->tree_mod_seq, 0);
> > fs_info->sb = sb;
> > fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
> > @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
> > init_waitqueue_head(_info->transaction_wait);
> > init_waitqueue_head(_info->transaction_blocked_wait);
> > init_waitqueue_head(_info->async_submit_wait);
> > +   init_waitqueue_head(_info->delayed_iputs_wait);
> >  
> > INIT_LIST_HEAD(_info->pinned_chunks);
> >  
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 3a90dc1d6b31..36f43876be22 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct 
> > btrfs_inode *inode, u64 bytes)
> >  * operations. Wait for it to finish so that
> >  * more space is released.
> >  

Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-27 Thread Josef Bacik
On Tue, Nov 27, 2018 at 07:43:39PM +, Filipe Manana wrote:
> On Tue, Nov 27, 2018 at 7:22 PM Josef Bacik  wrote:
> >
> > On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote:
> > > On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
> > > >
> > > > I noticed in a giant dbench run that we spent a lot of time on lock
> > > > contention while running transaction commit.  This is because dbench
> > > > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > > > they all run the delayed refs first thing, so they all contend with
> > > > each other.  This leads to seconds of 0 throughput.  Change this to only
> > > > run the delayed refs if we're the ones committing the transaction.  This
> > > > makes the latency go away and we get no more lock contention.
> > >
> > > Can you share the following in the changelog please?
> > >
> > > 1) How did you ran dbench (parameters, config).
> > >
> > > 2) What results did you get before and after this change. So that we all 
> > > get
> > > an idea of how good the impact is.
> > >
> > > While the reduced contention makes all sense and seems valid, I'm not
> > > sure this is always a win.
> > > It certainly is when multiple tasks are calling
> > > btrfs_commit_transaction() simultaneously, but,
> > > what about when only one does it?
> > >
> > > By running all delayed references inside the critical section of the
> > > transaction commit
> > > (state == TRANS_STATE_COMMIT_START), instead of running most of them
> > > outside/before,
> > > we will be blocking for a longer a time other tasks calling
> > > btrfs_start_transaction() (which is used
> > > a lot - creating files, unlinking files, adding links, etc, and even 
> > > fsync).
> > >
> > > Won't there by any other types of workload and tests other then dbench
> > > that can get increased
> > > latency and/or smaller throughput?
> > >
> > > I find that sort of information useful to have in the changelog. If
> > > you verified that or you think
> > > it's irrelevant to measure/consider, it would be great to have it
> > > mentioned in the changelog
> > > (and explained).
> > >
> >
> > Yeah I thought about the delayed refs being run in the critical section now,
> > that's not awesome.  I'll drop this for now, I think just having a mutex 
> > around
> > running delayed refs will be good enough, since we want people who care 
> > about
> > flushing delayed refs to wait around for that to finish happening.  Thanks,
> 
> Well, I think we can have a solution that doesn't bring such trade-off
> nor introducing a mutex.
> We could do like what is currently done for writing space caches, to
> make sure only the first task
> calling commit transaction does the work and all others do nothing
> except waiting for the commit to finish:
> 
> btrfs_commit_transaction()
>if (!test_and_set_bit(BTRFS_TRANS_COMMIT_START, _trans->flags)) {
>run delayed refs before entering critical section
>}
> 

That was my first inclination but then one of the other committers goes past
this and gets into the start TRANS_STATE_COMMIT_START place and has to wait for
the guy running the delayed refs to finish before moving forward, essentially
extending the time in the critical section for the same reason.  The mutex means
that if 9000 people try to commit the transaction, 1 guy gets to run the delayed
refs and everybody else waits for him to finish, and in the meantime the
critical section is small.  Thanks,

Josef


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-27 Thread Josef Bacik
On Tue, Nov 27, 2018 at 10:26:15AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:09 ч., Josef Bacik wrote:
> > The cleaner thread usually takes care of delayed iputs, with the
> > exception of the btrfs_end_transaction_throttle path.  The cleaner
> > thread only gets woken up every 30 seconds, so instead wake it up to do
> > it's work so that we can free up that space as quickly as possible.
> 
> Have you done any measurements how this affects the overall system. I
> suspect this introduces a lot of noise since now we are going to be
> doing a thread wakeup on every iput, does this give a chance to have
> nice, large batches of iputs that  the cost of wake up can be amortized
> across?

I ran the whole patchset with our A/B testing stuff and the patchset was a 5%
win overall, so I'm inclined to think it's fine.  Thanks,

Josef


Re: [PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-11-27 Thread Josef Bacik
On Mon, Nov 26, 2018 at 02:25:52PM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 21:03 ч., Josef Bacik wrote:
> > With the introduction of the per-inode block_rsv it became possible to
> > have really really large reservation requests made because of data
> > fragmentation.  Since the ticket stuff assumed that we'd always have
> > relatively small reservation requests it just killed all tickets if we
> > were unable to satisfy the current request.  However this is generally
> > not the case anymore.  So fix this logic to instead see if we had a
> > ticket that we were able to give some reservation to, and if we were
> > continue the flushing loop again.  Likewise we make the tickets use the
> > space_info_add_old_bytes() method of returning what reservation they did
> > receive in hopes that it could satisfy reservations down the line.
> 
> 
> The logic of the patch can be summarised as follows:
> 
> If no progress is made for a ticket, then start fail all tickets until
> the first one that has progress made on its reservation (inclusive). In
> this case this first ticket will be failed but at least it's space will
> be reused via space_info_add_old_bytes.
> 
> Frankly this seem really arbitrary.

It's not though.  The tickets are in order of who requested the reservation.
Because we will backfill reservations for things like hugely fragmented files or
large amounts of delayed refs we can have spikes where we're trying to reserve
100mb's of metadata space.  We may fill 50mb of that before we run out of space.
Well so we can't satisfy that reservation, but the small 100k reservations that
are waiting to be serviced can be satisfied and they can run.  The alternative
is you get ENOSPC and then you can turn around and touch a file no problem
because it's a small reservation and there was room for it.  This patch enables
better behavior for the user.  Thanks,

Josef


Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-27 Thread Josef Bacik
On Fri, Nov 23, 2018 at 04:59:32PM +, Filipe Manana wrote:
> On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
> >
> > I noticed in a giant dbench run that we spent a lot of time on lock
> > contention while running transaction commit.  This is because dbench
> > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> > they all run the delayed refs first thing, so they all contend with
> > each other.  This leads to seconds of 0 throughput.  Change this to only
> > run the delayed refs if we're the ones committing the transaction.  This
> > makes the latency go away and we get no more lock contention.
> 
> Can you share the following in the changelog please?
> 
> 1) How did you ran dbench (parameters, config).
> 
> 2) What results did you get before and after this change. So that we all get
> an idea of how good the impact is.
> 
> While the reduced contention makes all sense and seems valid, I'm not
> sure this is always a win.
> It certainly is when multiple tasks are calling
> btrfs_commit_transaction() simultaneously, but,
> what about when only one does it?
> 
> By running all delayed references inside the critical section of the
> transaction commit
> (state == TRANS_STATE_COMMIT_START), instead of running most of them
> outside/before,
> we will be blocking for a longer a time other tasks calling
> btrfs_start_transaction() (which is used
> a lot - creating files, unlinking files, adding links, etc, and even fsync).
> 
> Won't there by any other types of workload and tests other then dbench
> that can get increased
> latency and/or smaller throughput?
> 
> I find that sort of information useful to have in the changelog. If
> you verified that or you think
> it's irrelevant to measure/consider, it would be great to have it
> mentioned in the changelog
> (and explained).
> 

Yeah I thought about the delayed refs being run in the critical section now,
that's not awesome.  I'll drop this for now, I think just having a mutex around
running delayed refs will be good enough, since we want people who care about
flushing delayed refs to wait around for that to finish happening.  Thanks,

Josef


Re: [PATCH 5/6] btrfs: introduce delayed_refs_rsv

2018-11-27 Thread Josef Bacik
On Mon, Nov 26, 2018 at 11:14:12AM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > Traditionally we've had voodoo in btrfs to account for the space that
> > delayed refs may take up by having a global_block_rsv.  This works most
> > of the time, except when it doesn't.  We've had issues reported and seen
> > in production where sometimes the global reserve is exhausted during
> > transaction commit before we can run all of our delayed refs, resulting
> > in an aborted transaction.  Because of this voodoo we have equally
> > dubious flushing semantics around throttling delayed refs which we often
> > get wrong.
> > 
> > So instead give them their own block_rsv.  This way we can always know
> > exactly how much outstanding space we need for delayed refs.  This
> > allows us to make sure we are constantly filling that reservation up
> > with space, and allows us to put more precise pressure on the enospc
> > system.  Instead of doing math to see if its a good time to throttle,
> > the normal enospc code will be invoked if we have a lot of delayed refs
> > pending, and they will be run via the normal flushing mechanism.
> > 
> > For now the delayed_refs_rsv will hold the reservations for the delayed
> > refs, the block group updates, and deleting csums.  We could have a
> > separate rsv for the block group updates, but the csum deletion stuff is
> > still handled via the delayed_refs so that will stay there.
> 
> Couldn't the same "no premature ENOSPC in critical section" effect be
> achieved if we simply allocate 2* num bytes in start transaction without
> adding the additional granularity for delayd refs? There seems to be a
> lot of supporting code added  and this obfuscates the simple idea that
> now we do 2* reservation and put it in a separate block_rsv structure.
> 
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h |  26 ++--
> >  fs/btrfs/delayed-ref.c   |  28 -
> >  fs/btrfs/disk-io.c   |   4 +
> >  fs/btrfs/extent-tree.c   | 279 
> > +++
> >  fs/btrfs/inode.c |   4 +-
> >  fs/btrfs/transaction.c   |  77 ++--
> >  include/trace/events/btrfs.h |   2 +
> >  7 files changed, 313 insertions(+), 107 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 8b41ec42f405..0c6d589c8ce4 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -448,8 +448,9 @@ struct btrfs_space_info {
> >  #defineBTRFS_BLOCK_RSV_TRANS   3
> >  #defineBTRFS_BLOCK_RSV_CHUNK   4
> >  #defineBTRFS_BLOCK_RSV_DELOPS  5
> > -#defineBTRFS_BLOCK_RSV_EMPTY   6
> > -#defineBTRFS_BLOCK_RSV_TEMP7
> > +#define BTRFS_BLOCK_RSV_DELREFS6
> > +#defineBTRFS_BLOCK_RSV_EMPTY   7
> > +#defineBTRFS_BLOCK_RSV_TEMP8
> >  
> >  struct btrfs_block_rsv {
> > u64 size;
> > @@ -812,6 +813,8 @@ struct btrfs_fs_info {
> > struct btrfs_block_rsv chunk_block_rsv;
> > /* block reservation for delayed operations */
> > struct btrfs_block_rsv delayed_block_rsv;
> > +   /* block reservation for delayed refs */
> > +   struct btrfs_block_rsv delayed_refs_rsv;
> >  
> > struct btrfs_block_rsv empty_block_rsv;
> >  
> > @@ -2628,7 +2631,7 @@ static inline u64 
> > btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info,
> >  }
> >  
> >  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
> >  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
> >  const u64 start);
> >  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache 
> > *bg);
> > @@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
> >  enum btrfs_flush_state {
> > FLUSH_DELAYED_ITEMS_NR  =   1,
> > FLUSH_DELAYED_ITEMS =   2,
> > -   FLUSH_DELALLOC  =   3,
> > -   FLUSH_DELALLOC_WAIT =   4,
> > -   ALLOC_CHUNK =   5,
> > -   COMMIT_TRANS=   6,
> > +   FLUSH_DELAYED_REFS_NR   =   3,
> > +   FLUSH_DELAYED_REFS  =   4,
> > +   FLUSH_DELALLOC  =   5,
> > +   FLUSH_DELALLOC_WAIT =  

Re: [PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-27 Thread Josef Bacik
On Thu, Nov 22, 2018 at 12:09:34PM +0200, Nikolay Borisov wrote:
> 
> 
> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > The cleanup_extent_op function actually would run the extent_op if it
> > needed running, which made the name sort of a misnomer.  Change it to
> > run_and_cleanup_extent_op, and move the actual cleanup work to
> > cleanup_extent_op so it can be used by check_ref_cleanup() in order to
> > unify the extent op handling.
> 
> The whole name extent_op is actually a misnomer since AFAIR this is some
> sort of modification of the references of metadata nodes. I don't see
> why it can't be made as yet another type of reference which is run for a
> given node.
> 

It would change the key for a metadata extent reference for non-skinny metadata,
and it sets the FULL_BACKREF flag.  Since it really only changes flags now we
could probably roll that into it's own thing, but that's out of scope for this
stuff.  Thanks,

Josef


Re: [PATCH] btrfs: Remove unnecessary code from __btrfs_rebalance

2018-11-26 Thread Josef Bacik
On Thu, Nov 22, 2018 at 10:17:33AM +0200, Nikolay Borisov wrote:
> The first step fo the rebalance process, ensuring there is 1mb free on
> each device. This number seems rather small. And in fact when talking
> to the original authors their opinions were:
> 
> "man that's a little bonkers"
> "i don't think we even need that code anymore"
> "I think it was there to make sure we had room for the blank 1M at the
> beginning. I bet it goes all the way back to v0"
> "we just don't need any of that tho, i say we just delete it"
> 
> Clearly, this piece of code has lost its original intent throughout
> the years. It doesn't really bring any real practical benefits to the
> relocation process. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 
> Suggested-by: Josef Bacik 

Reviewed-by: Josef Bacik 

Thanks,

Josef


[PATCH 1/3] btrfs: run delayed iputs before committing

2018-11-21 Thread Josef Bacik
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 90423b6749b7..3a90dc1d6b31 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4833,6 +4833,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
+   /*
+* If we have pending delayed iputs then we could free up a bunch of
+* pinned space, so make sure we run the iputs before we do our pinned
+* bytes check below.
+*/
+   mutex_lock(_info->cleaner_delayed_iput_mutex);
+   btrfs_run_delayed_iputs(fs_info);
+   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
return PTR_ERR(trans);
-- 
2.14.3



[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-21 Thread Josef Bacik
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  The cleaner
thread only gets woken up every 30 seconds, so instead wake it up to do
it's work so that we can free up that space as quickly as possible.

Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3da9ac463344..3c42d8887183 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3264,6 +3264,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
ASSERT(list_empty(>delayed_iput));
list_add_tail(>delayed_iput, _info->delayed_iputs);
spin_unlock(_info->delayed_iput_lock);
+   wake_up_process(fs_info->cleaner_kthread);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
-- 
2.14.3



[PATCH] btrfs: only run delayed refs if we're committing

2018-11-21 Thread Josef Bacik
I noticed in a giant dbench run that we spent a lot of time on lock
contention while running transaction commit.  This is because dbench
results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
they all run the delayed refs first thing, so they all contend with
each other.  This leads to seconds of 0 throughput.  Change this to only
run the delayed refs if we're the ones committing the transaction.  This
makes the latency go away and we get no more lock contention.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3c1be9db897c..41cc96cc59a3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   /* make a pass through all the delayed refs we have so far
-* any runnings procs may add more while we are here
-*/
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
cur_trans = trans->transaction;
 
/*
@@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
 
btrfs_create_pending_block_groups(trans);
 
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
 
@@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
spin_unlock(_info->trans_lock);
}
 
+   /*
+* We are now the only one in the commit area, we can run delayed refs
+* without hitting a bunch of lock contention from a lot of people
+* trying to commit the transaction at once.
+*/
+   ret = btrfs_run_delayed_refs(trans, 0);
+   if (ret)
+   goto cleanup_transaction;
+
extwriter_counter_dec(cur_trans, trans->type);
 
ret = btrfs_start_delalloc_flush(fs_info);
-- 
2.14.3



[PATCH 0/3] Delayed iput fixes

2018-11-21 Thread Josef Bacik
Here are some delayed iput fixes.  Delayed iputs can hold reservations for a
while and there's no real good way to make sure they were gone for good, which
means we could early enospc when in reality if we had just waited for the iput
we would have had plenty of space.  So fix this up by making us wait for delayed
iputs when deciding if we need to commit for enospc flushing, and then cleanup
and rework how we run delayed iputs to make it more straightforward to wait on
them and make sure we're all done using them.  Thanks,

Josef


[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-11-21 Thread Josef Bacik
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  4 +++-
 fs/btrfs/disk-io.c |  5 ++---
 fs/btrfs/extent-tree.c |  9 +
 fs/btrfs/inode.c   | 21 +
 4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 709de7471d86..a835fe7076eb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -912,7 +912,8 @@ struct btrfs_fs_info {
 
spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
-   struct mutex cleaner_delayed_iput_mutex;
+   atomic_t nr_delayed_iputs;
+   wait_queue_head_t delayed_iputs_wait;
 
/* this protects tree_mod_seq_list */
spinlock_t tree_mod_seq_lock;
@@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
  u64 start, u64 num_bytes, u64 min_size,
  loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5918ff8241b..3f81dfaefa32 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
goto sleep;
}
 
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
 
again = btrfs_clean_one_deleted_snapshot(root);
mutex_unlock(_info->cleaner_mutex);
@@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->delete_unused_bgs_mutex);
mutex_init(_info->reloc_mutex);
mutex_init(_info->delalloc_root_mutex);
-   mutex_init(_info->cleaner_delayed_iput_mutex);
seqlock_init(_info->profiles_lock);
 
INIT_LIST_HEAD(_info->dirty_cowonly_roots);
@@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
atomic_set(_info->defrag_running, 0);
atomic_set(_info->qgroup_op_seq, 0);
atomic_set(_info->reada_works_cnt, 0);
+   atomic_set(_info->nr_delayed_iputs, 0);
atomic64_set(_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
init_waitqueue_head(_info->transaction_wait);
init_waitqueue_head(_info->transaction_blocked_wait);
init_waitqueue_head(_info->async_submit_wait);
+   init_waitqueue_head(_info->delayed_iputs_wait);
 
INIT_LIST_HEAD(_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3a90dc1d6b31..36f43876be22 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 * operations. Wait for it to finish so that
 * more space is released.
 */
-   
mutex_lock(_info->cleaner_delayed_iput_mutex);
-   
mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   ret = btrfs_wait_on_delayed_iputs(fs_info);
+   if (ret)
+   return ret;
goto again;
} else {
btrfs_end_transaction(trans);
@@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * pinned space, so make sure we run the iputs before we do our pinned
 * bytes check below.
 */
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   wait_event(fs_info->delayed_iputs_wait,
+  atomic_read(_info->nr_delayed_iputs) == 0);
 
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
diff --git a/fs/btrfs/inode.c b/fs/b

[PATCH 4/7] btrfs: call btrfs_create_pending_block_groups unconditionally

2018-11-21 Thread Josef Bacik
The first thing we do is loop through the list, this

if (!list_empty())
btrfs_create_pending_block_groups();

thing is just wasted space.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 3 +--
 fs/btrfs/transaction.c | 6 ++
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a0f8880ee5dd..b9b829c8825c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3004,8 +3004,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
}
 
if (run_all) {
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
spin_lock(_refs->lock);
node = rb_first_cached(_refs->href_root);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a4682a696fb6..826a15a07fce 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -845,8 +845,7 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
btrfs_trans_release_chunk_metadata(trans);
 
@@ -1937,8 +1936,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
cur_trans->delayed_refs.flushing = 1;
smp_wmb();
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
ret = btrfs_run_delayed_refs(trans, 0);
if (ret) {
-- 
2.14.3



[PATCH 6/7] btrfs: cleanup pending bgs on transaction abort

2018-11-21 Thread Josef Bacik
We may abort the transaction during a commit and not have a chance to
run the pending bgs stuff, which will leave block groups on our list and
cause us accounting issues and leaked memory.  Fix this by running the
pending bgs when we cleanup a transaction.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 826a15a07fce..3c1be9db897c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2269,6 +2269,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_scrub_continue(fs_info);
 cleanup_transaction:
btrfs_trans_release_metadata(trans);
+   /* This cleans up the pending block groups list properly. */
+   if (!trans->aborted)
+   trans->aborted = ret;
+   btrfs_create_pending_block_groups(trans);
btrfs_trans_release_chunk_metadata(trans);
trans->block_rsv = NULL;
btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
-- 
2.14.3



[PATCH 5/7] btrfs: just delete pending bgs if we are aborted

2018-11-21 Thread Josef Bacik
We still need to do all of the accounting cleanup for pending block
groups if we abort.  So set the ret to trans->aborted so if we aborted
the cleanup happens and everybody is happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b9b829c8825c..90423b6749b7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10500,11 +10500,17 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans)
struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_block_group_item item;
struct btrfs_key key;
-   int ret = 0;
+   int ret;
 
if (!trans->can_flush_pending_bgs)
return;
 
+   /*
+* If we aborted the transaction with pending bg's we need to just
+* cleanup the list and carry on.
+*/
+   ret = trans->aborted;
+
while (!list_empty(>new_bgs)) {
block_group = list_first_entry(>new_bgs,
   struct btrfs_block_group_cache,
-- 
2.14.3



[PATCH 7/7] btrfs: wait on ordered extents on abort cleanup

2018-11-21 Thread Josef Bacik
If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8e7926c91e35..c5918ff8241b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4171,6 +4171,14 @@ static void btrfs_destroy_all_ordered_extents(struct 
btrfs_fs_info *fs_info)
spin_lock(_info->ordered_root_lock);
}
spin_unlock(_info->ordered_root_lock);
+
+   /*
+* We need this here because if we've been flipped read-only we won't
+* get sync() from the umount, so we need to make sure any ordered
+* extents that haven't had their dirty pages IO start writeout yet
+* actually get run and error out properly.
+*/
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 }
 
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-- 
2.14.3



[PATCH 0/7] Abort cleanup fixes

2018-11-21 Thread Josef Bacik
A new xfstests that really hammers on transaction aborts (generic/495 I think?)
uncovered a lot of random issues.  Some of these were introduced with the new
delayed refs rsv patches, others were just exposed by them, such as the pending
bg stuff.  With these patches in place I stopped getting all the random
leftovers and WARN_ON()'s when running whichever xfstest that was and things are
much smoother now.  Thanks,

Josef


[PATCH 3/7] btrfs: handle delayed ref head accounting cleanup in abort

2018-11-21 Thread Josef Bacik
We weren't doing any of the accounting cleanup when we aborted
transactions.  Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  5 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/extent-tree.c | 13 ++---
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ccc5019172b..709de7471d86 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -35,6 +35,7 @@
 struct btrfs_trans_handle;
 struct btrfs_transaction;
 struct btrfs_pending_snapshot;
+struct btrfs_delayed_ref_root;
 extern struct kmem_cache *btrfs_trans_handle_cachep;
 extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
@@ -2643,6 +2644,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
   unsigned long count);
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
 unsigned long count, u64 transid, int wait);
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head);
 int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
len);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d02748cf3f6..8e7926c91e35 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4223,6 +4223,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (pin_bytes)
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
btrfs_put_delayed_ref_head(head);
cond_resched();
spin_lock(_refs->lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e31980d451c2..a0f8880ee5dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2457,12 +2457,11 @@ static int run_and_cleanup_extent_op(struct 
btrfs_trans_handle *trans,
return ret ? ret : 1;
 }
 
-static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
-   struct btrfs_delayed_ref_head *head)
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head)
 {
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_delayed_ref_root *delayed_refs =
-   >transaction->delayed_refs;
int nr_items = 1;
 
if (head->total_ref_mod < 0) {
@@ -2540,7 +2539,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 
trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
@@ -7218,7 +7217,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 2/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head

2018-11-21 Thread Josef Bacik
Instead of open coding this stuff use the helper instead.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f062fb0487cd..7d02748cf3f6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4215,12 +4215,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (head->must_insert_reserved)
pin_bytes = true;
btrfs_free_delayed_extent_op(head->extent_op);
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
-   atomic_dec(_refs->num_entries);
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
mutex_unlock(>mutex);
-- 
2.14.3



[PATCH 1/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock

2018-11-21 Thread Josef Bacik
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index beaa58e742b5..f062fb0487cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4197,16 +4197,9 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
 
head = rb_entry(node, struct btrfs_delayed_ref_head,
href_node);
-   if (!mutex_trylock(>mutex)) {
-   refcount_inc(>refs);
-   spin_unlock(_refs->lock);
-
-   mutex_lock(>mutex);
-   mutex_unlock(>mutex);
-   btrfs_put_delayed_ref_head(head);
-   spin_lock(_refs->lock);
+   if (btrfs_delayed_ref_lock(delayed_refs, head))
continue;
-   }
+
spin_lock(>lock);
while ((n = rb_first_cached(>ref_tree)) != NULL) {
ref = rb_entry(n, struct btrfs_delayed_ref_node,
-- 
2.14.3



[PATCH 3/8] btrfs: don't use global rsv for chunk allocation

2018-11-21 Thread Josef Bacik
We've done this forever because of the voodoo around knowing how much
space we have.  However we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily.  Instead use the actual used amount when
determining if we need to allocate a chunk or not.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7a30fbc05e5e..a91b3183dcae 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4388,21 +4388,12 @@ static inline u64 calc_global_rsv_need_space(struct 
btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
-   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
u64 bytes_used = btrfs_space_info_used(sinfo, false);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
return 1;
 
-   /*
-* We need to take into account the global rsv because for all intents
-* and purposes it's used space.  Don't worry about locking the
-* global_rsv, it doesn't change except when the transaction commits.
-*/
-   if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   bytes_used += calc_global_rsv_need_space(global_rsv);
-
/*
 * in limited mode, we want to have some free space up to
 * about 1% of the FS size.
-- 
2.14.3



[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-11-21 Thread Josef Bacik
With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/extent-tree.c   | 18 +-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c6d589c8ce4..8ccc5019172b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
FLUSH_DELALLOC  =   5,
FLUSH_DELALLOC_WAIT =   6,
ALLOC_CHUNK =   7,
-   COMMIT_TRANS=   8,
+   ALLOC_CHUNK_FORCE   =   8,
+   COMMIT_TRANS=   9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a91b3183dcae..e6bb6ce23c84 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
btrfs_end_transaction(trans);
break;
case ALLOC_CHUNK:
+   case ALLOC_CHUNK_FORCE:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
ret = do_chunk_alloc(trans,
 btrfs_metadata_alloc_profile(fs_info),
-CHUNK_ALLOC_NO_FORCE);
+(state == ALLOC_CHUNK) ?
+CHUNK_ALLOC_NO_FORCE :
+CHUNK_ALLOC_FORCE);
btrfs_end_transaction(trans);
if (ret > 0 || ret == -ENOSPC)
ret = 0;
@@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
commit_cycles--;
}
 
+   /*
+* We don't want to force a chunk allocation until we've tried
+* pretty hard to reclaim space.  Think of the case where we
+* free'd up a bunch of space and so have a lot of pinned space
+* to reclaim.  We would rather use that than possibly create a
+* underutilized metadata chunk.  So if this is our first run
+* through the flushing state machine skip ALLOC_CHUNK_FORCE and
+* commit the transaction.  If nothing has changed the next go
+* around then we can force a chunk allocation.
+*/
+   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+   flush_state++;
+
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
{ FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
+   { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3



[PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-11-21 Thread Josef Bacik
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e6bb6ce23c84..983d086fa768 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4791,6 +4791,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+   u64 orig_bytes;
u64 bytes;
int error;
struct list_head list;
@@ -5012,7 +5013,7 @@ static inline int need_do_async_reclaim(struct 
btrfs_fs_info *fs_info,
!test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
struct reserve_ticket *ticket;
 
@@ -5021,7 +5022,10 @@ static void wake_all_tickets(struct list_head *head)
list_del_init(>list);
ticket->error = -ENOSPC;
wake_up(>wait);
+   if (ticket->bytes != ticket->orig_bytes)
+   return true;
}
+   return false;
 }
 
 /*
@@ -5089,8 +5093,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
-   wake_all_tickets(_info->tickets);
-   space_info->flush = 0;
+   if (wake_all_tickets(_info->tickets)) {
+   flush_state = FLUSH_DELAYED_ITEMS_NR;
+   commit_cycles--;
+   } else {
+   space_info->flush = 0;
+   }
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
}
@@ -5142,10 +5150,11 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info,
-  struct reserve_ticket *ticket, u64 orig_bytes)
+  struct reserve_ticket *ticket)
 
 {
DEFINE_WAIT(wait);
+   u64 reclaim_bytes = 0;
int ret = 0;
 
spin_lock(_info->lock);
@@ -5166,14 +5175,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
ret = ticket->error;
if (!list_empty(>list))
list_del_init(>list);
-   if (ticket->bytes && ticket->bytes < orig_bytes) {
-   u64 num_bytes = orig_bytes - ticket->bytes;
-   update_bytes_may_use(space_info, -num_bytes);
-   trace_btrfs_space_reservation(fs_info, "space_info",
- space_info->flags, num_bytes, 0);
-   }
+   if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+   reclaim_bytes = ticket->orig_bytes - ticket->bytes;
spin_unlock(_info->lock);
 
+   if (reclaim_bytes)
+   space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
return ret;
 }
 
@@ -5199,6 +5206,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket ticket;
u64 used;
+   u64 reclaim_bytes = 0;
int ret = 0;
 
ASSERT(orig_bytes);
@@ -5234,6 +5242,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 * the list and we will do our own flushing further down.
 */
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+   ticket.orig_bytes = orig_bytes;
ticket.bytes = orig_bytes;
ticket.error = 0;
init_waitqueue_head();
@@ -5274,25 +5283,21 @@ static int __reserve_metadata_bytes(struct 
btrfs_fs_info *fs_info,
return ret;
 
if (flush == BTRFS_RESERVE_FLUSH_ALL)
-   return wait_reserve_ticket(fs_info, sp

[PATCH 7/8] btrfs: be more explicit about allowed flush states

2018-11-21 Thread Josef Bacik
For FLUSH_LIMIT flushers we really can only allocate chunks and flush
delayed inode items, everything else is problematic.  I added a bunch of
new states and it lead to weirdness in the FLUSH_LIMIT case because I
forgot about how it worked.  So instead explicitly declare the states
that are ok for flushing with FLUSH_LIMIT and use that for our state
machine.  Then as we add new things that are safe we can just add them
to this list.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0e9ba77e5316..e31980d451c2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5112,12 +5112,18 @@ void btrfs_init_async_reclaim_work(struct work_struct 
*work)
INIT_WORK(work, btrfs_async_reclaim_metadata_space);
 }
 
+static const enum btrfs_flush_state priority_flush_states[] = {
+   FLUSH_DELAYED_ITEMS_NR,
+   FLUSH_DELAYED_ITEMS,
+   ALLOC_CHUNK,
+};
+
 static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
struct reserve_ticket *ticket)
 {
u64 to_reclaim;
-   int flush_state = FLUSH_DELAYED_ITEMS_NR;
+   int flush_state = 0;
 
spin_lock(_info->lock);
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
@@ -5129,7 +5135,8 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
spin_unlock(_info->lock);
 
do {
-   flush_space(fs_info, space_info, to_reclaim, flush_state);
+   flush_space(fs_info, space_info, to_reclaim,
+   priority_flush_states[flush_state]);
flush_state++;
spin_lock(_info->lock);
if (ticket->bytes == 0) {
@@ -5137,15 +5144,7 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
return;
}
spin_unlock(_info->lock);
-
-   /*
-* Priority flushers can't wait on delalloc without
-* deadlocking.
-*/
-   if (flush_state == FLUSH_DELALLOC ||
-   flush_state == FLUSH_DELALLOC_WAIT)
-   flush_state = ALLOC_CHUNK;
-   } while (flush_state < COMMIT_TRANS);
+   } while (flush_state < ARRAY_SIZE(priority_flush_states));
 }
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
-- 
2.14.3



[PATCH 0/8] Enospc cleanups and fixes

2018-11-21 Thread Josef Bacik
The delayed refs rsv patches exposed a bunch of issues in our enospc
infrastructure that needed to be addressed.  These aren't really one coherent
group, but they are all around flushing and reservations.
may_commit_transaction() needed to be updated a little bit, and we needed to add
a new state to force chunk allocation if things got dicey.  Also because we can
end up needed to reserve a whole bunch of extra space for outstanding delayed
refs we needed to add the ability to only ENOSPC tickets that were too big to
satisfy, instead of failing all of the tickets.  There's also a fix in here for
one of the corner cases where we didn't quite have enough space reserved for the
delayed refs we were generating during evict().  Thanks,

Josef


[PATCH 6/8] btrfs: loop in inode_rsv_refill

2018-11-21 Thread Josef Bacik
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 56 --
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 983d086fa768..0e9ba77e5316 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5776,6 +5776,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
+static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
+ u64 *metadata_bytes, u64 *qgroup_bytes)
+{
+   *metadata_bytes = 0;
+   *qgroup_bytes = 0;
+
+   spin_lock(_rsv->lock);
+   if (block_rsv->reserved < block_rsv->size)
+   *metadata_bytes = block_rsv->size - block_rsv->reserved;
+   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+   *qgroup_bytes = block_rsv->qgroup_rsv_size -
+   block_rsv->qgroup_rsv_reserved;
+   spin_unlock(_rsv->lock);
+}
+
 /**
  * btrfs_inode_rsv_refill - refill the inode block rsv.
  * @inode - the inode we are refilling.
@@ -5791,25 +5806,37 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
 {
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = >block_rsv;
-   u64 num_bytes = 0;
+   u64 num_bytes = 0, last = 0;
u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
-   spin_lock(_rsv->lock);
-   if (block_rsv->reserved < block_rsv->size)
-   num_bytes = block_rsv->size - block_rsv->reserved;
-   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-   qgroup_num_bytes = block_rsv->qgroup_rsv_size -
-  block_rsv->qgroup_rsv_reserved;
-   spin_unlock(_rsv->lock);
-
+   __get_refill_bytes(block_rsv, _bytes, _num_bytes);
if (num_bytes == 0)
return 0;
 
-   ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
-   if (ret)
-   return ret;
-   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   do {
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, 
qgroup_num_bytes, true);
+   if (ret)
+   return ret;
+   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   if (ret) {
+   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   last = num_bytes;
+   /*
+* If we are fragmented we can end up with a lot of
+* outstanding extents which will make our size be much
+* larger than our reserved amount.  If we happen to
+* try to do a reservation here that may result in us
+* trying to do a pretty hefty reservation, which we may
+* not need once delalloc flushing happens.  If this is
+* the case try and do the reserve again.
+*/
+   if (flush == BTRFS_RESERVE_FLUSH_ALL)
+   __get_refill_bytes(block_rsv, _bytes,
+  _num_bytes);
+   }
+   } while (ret && last != num_bytes);
+
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, false);
trace_btrfs_space_reservation(root->fs_info, "delalloc",
@@ -5819,8 +5846,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
spin_lock(_rsv->lock);
block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
spin_unlock(_rsv->lock);
-   } else
-   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   }
return ret;
 }
 
-- 
2.14.3



[PATCH 2/8] btrfs: dump block_rsv whe dumping space info

2018-11-21 Thread Josef Bacik
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0dca250dc02e..7a30fbc05e5e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8052,6 +8052,15 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
+#define DUMP_BLOCK_RSV(fs_info, rsv_name)  \
+do {   \
+   struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name;   \
+   spin_lock(&__rsv->lock);\
+   btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu",  \
+  __rsv->size, __rsv->reserved);   \
+   spin_unlock(&__rsv->lock);  \
+} while (0)
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *info, u64 bytes,
int dump_block_groups)
@@ -8071,6 +8080,12 @@ static void dump_space_info(struct btrfs_fs_info 
*fs_info,
info->bytes_readonly);
spin_unlock(>lock);
 
+   DUMP_BLOCK_RSV(fs_info, global_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
+
if (!dump_block_groups)
return;
 
-- 
2.14.3



[PATCH 8/8] btrfs: reserve extra space during evict()

2018-11-21 Thread Josef Bacik
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cae30f6c095f..3da9ac463344 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
+   u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
int failures = 0;
 
for (;;) {
struct btrfs_trans_handle *trans;
int ret;
 
-   ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
+   ret = btrfs_block_rsv_refill(root, rsv,
+rsv->size + delayed_refs_extra,
 BTRFS_RESERVE_FLUSH_LIMIT);
 
if (ret && ++failures > 2) {
@@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
return ERR_PTR(-ENOSPC);
}
 
+   /*
+* Evict can generate a large amount of delayed refs without
+* having a way to add space back since we exhaust our temporary
+* block rsv.  We aren't allowed to do FLUSH_ALL in this case
+* because we could deadlock with so many things in the flushing
+* code, so we have to try and hold some extra space to
+* compensate for our delayed ref generation.  If we can't get
+* that space then we need see if we can steal our minimum from
+* the global reserve.  We will be ratelimited by the amount of
+* space we have for the delayed refs rsv, so we'll end up
+* committing and trying again.
+*/
trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans) || !ret)
+   if (IS_ERR(trans) || !ret) {
+   if (!IS_ERR(trans)) {
+   trans->block_rsv = _info->trans_block_rsv;
+   trans->bytes_reserved = delayed_refs_extra;
+   btrfs_block_rsv_migrate(rsv, trans->block_rsv,
+   delayed_refs_extra, 1);
+   }
return trans;
+   }
 
/*
 * Try to steal from the global reserve if there is space for
-- 
2.14.3



[PATCH 1/8] btrfs: check if free bgs for commit

2018-11-21 Thread Josef Bacik
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8af68b13fa27..0dca250dc02e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4843,10 +4843,18 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+   trans = btrfs_join_transaction(fs_info->extent_root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+   /*
+* See if there is enough pinned space to make this reservation, or if
+* we have bg's that are going to be freed, allowing us to possibly do a
+* chunk allocation the next loop through.
+*/
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
+   __percpu_counter_compare(_info->total_bytes_pinned, bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
/*
@@ -4854,7 +4862,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * this reservation.
 */
if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   goto enospc;
 
spin_lock(_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
@@ -4868,17 +4876,14 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
bytes -= reclaim_bytes;
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
+bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   goto enospc;
 commit:
-   trans = btrfs_join_transaction(fs_info->extent_root);
-   if (IS_ERR(trans))
-   return -ENOSPC;
-
return btrfs_commit_transaction(trans);
+enospc:
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
 }
 
 /*
-- 
2.14.3



[PATCH 6/6] btrfs: fix truncate throttling

2018-11-21 Thread Josef Bacik
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 79 
 1 file changed, 17 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8532a2eb56d1..cae30f6c095f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 bytes_deleted)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret;
-
-   /*
-* This is only used to apply pressure to the enospc system, we don't
-* intend to use this reservation at all.
-*/
-   bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-   bytes_deleted *= fs_info->nodesize;
-   ret = btrfs_block_rsv_add(root, _info->trans_block_rsv,
- bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-   if (!ret) {
-   trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid,
- bytes_deleted, 1);
-   trans->bytes_reserved += bytes_deleted;
-   }
-   return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
-   bool should_end = false;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
btrfs_abort_transaction(trans, ret);
break;
}
-   if (btrfs_should_throttle_delayed_refs(trans))
-   btrfs_async_run_delayed_refs(fs_info,
-   trans->delayed_ref_updates * 2,
-   trans->transid, 0);
if (be_nice) {
-   if (truncate_space_check(trans, root,
-extent_num_bytes)) {
-   should_end = true;
-   }
if (btrfs_should_throttle_delayed_refs(trans))
should_throttle = true;
}
@@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
if (path->slots[0] == 0 ||
path->slots[0] != pending_del_slot ||
-   should_throttle || should_end) {
+   should_throttle) {
if (pending_del_nr) {
ret = btrfs_del_items(trans, root, path,
pending_del_slot,
@@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
pending_del_nr = 0;
}
btrfs_release_path(path);
-   if (should_throttle) {
-   unsigned long updates = 
trans->delayed_ref_updates;
-   if (updates) {
-   trans->delayed_ref_updates = 0;
-   ret = btrfs_run_delayed_refs(trans,
-  updates * 2);
-   if (ret)
-   break;
-   }
-   }
+
/*
-* if we failed to refill our space rsv, bail out
-* and let the transaction restart
+* We can generate a lot of delayed refs, so we need to
+* throttle every once and a while and make sure we're
+* adding enough space to keep up with the work we are
+* generating.  Since we hold a transaction here we
+* can't flush, and we don't want to FLUSH_LIMIT because
+* we could have generated too many delayed refs to
+* actually allocate, so just bail if we're short and
+ 

[PATCH 5/6] btrfs: introduce delayed_refs_rsv

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  26 ++--
 fs/btrfs/delayed-ref.c   |  28 -
 fs/btrfs/disk-io.c   |   4 +
 fs/btrfs/extent-tree.c   | 279 +++
 fs/btrfs/inode.c |   4 +-
 fs/btrfs/transaction.c   |  77 ++--
 include/trace/events/btrfs.h |   2 +
 7 files changed, 313 insertions(+), 107 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b41ec42f405..0c6d589c8ce4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -448,8 +448,9 @@ struct btrfs_space_info {
 #defineBTRFS_BLOCK_RSV_TRANS   3
 #defineBTRFS_BLOCK_RSV_CHUNK   4
 #defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+#define BTRFS_BLOCK_RSV_DELREFS6
+#defineBTRFS_BLOCK_RSV_EMPTY   7
+#defineBTRFS_BLOCK_RSV_TEMP8
 
 struct btrfs_block_rsv {
u64 size;
@@ -812,6 +813,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv chunk_block_rsv;
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
+   /* block reservation for delayed refs */
+   struct btrfs_block_rsv delayed_refs_rsv;
 
struct btrfs_block_rsv empty_block_rsv;
 
@@ -2628,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
btrfs_fs_info *fs_info,
 }
 
 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
 void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 const u64 start);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
@@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR  =   1,
FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
+   FLUSH_DELAYED_REFS_NR   =   3,
+   FLUSH_DELAYED_REFS  =   4,
+   FLUSH_DELALLOC  =   5,
+   FLUSH_DELALLOC_WAIT =   6,
+   ALLOC_CHUNK =   7,
+   COMMIT_TRANS=   8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
@@ -2796,6 +2801,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info,
+   enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+  struct btrfs_block_rsv *src,
+  u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 48725fa757a3..96de92588f06 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -474,11

[PATCH 0/6] Delayed refs rsv

2018-11-21 Thread Josef Bacik
This patchset changes how we do space reservations for delayed refs.  We were
hitting probably 20-40 enospc abort's per day in production while running
delayed refs at transaction commit time.  This means we ran out of space in the
global reserve and couldn't easily get more space in use_block_rsv().

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
we're running short on space and when it's time to force a commit.  A failure
rate of 20-40 file systems when we run hundreds of thousands of them isn't super
high, but cleaning up this code will make things less ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the global
reserve for that spill over, which fits better into it's desired use than a full
blown reservation.  This first approach is to simply take how many times we're
reserving space for and multiply that by 2 in order to save enough space for the
delayed refs that could be generated.  This is a niave approach and will
probably evolve, but for now it works.

With this patchset we've gone down to 2-8 failures per week.  It's not perfect,
there are some corner cases that still need to be addressed, but is
significantly better than what we had.  Thanks,

Josef


[PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index b3e4c9fcb664..48725fa757a3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
ref->in_tree = 0;
btrfs_put_delayed_ref(ref);
atomic_dec(_refs->num_entries);
-   if (trans->delayed_ref_updates)
-   trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
if (ref->action == BTRFS_ADD_DELAYED_REF)
list_add_tail(>add_list, >ref_add_list);
atomic_inc(>num_entries);
-   trans->delayed_ref_updates++;
spin_unlock(>lock);
return ret;
 }
-- 
2.14.3



[PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 67 +-
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c36b3a42f2bb..e3ed3507018d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+
+   if (head->total_ref_mod < 0) {
+   struct btrfs_space_info *space_info;
+   u64 flags;
+
+   if (head->is_data)
+   flags = BTRFS_BLOCK_GROUP_DATA;
+   else if (head->is_system)
+   flags = BTRFS_BLOCK_GROUP_SYSTEM;
+   else
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   space_info = __find_space_info(fs_info, flags);
+   ASSERT(space_info);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+   if (head->is_data) {
+   spin_lock(_refs->lock);
+   delayed_refs->pending_csums -= head->num_bytes;
+   spin_unlock(_refs->lock);
+   }
+   }
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+ head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *head)
 {
@@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-   trace_run_delayed_ref_head(fs_info, head, 0);
-
-   if (head->total_ref_mod < 0) {
-   struct btrfs_space_info *space_info;
-   u64 flags;
-
-   if (head->is_data)
-   flags = BTRFS_BLOCK_GROUP_DATA;
-   else if (head->is_system)
-   flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   else
-   flags = BTRFS_BLOCK_GROUP_METADATA;
-   space_info = __find_space_info(fs_info, flags);
-   ASSERT(space_info);
-   percpu_counter_add_batch(_info->total_bytes_pinned,
-  -head->num_bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-   if (head->is_data) {
-   spin_lock(_refs->lock);
-   delayed_refs->pending_csums -= head->num_bytes;
-   spin_unlock(_refs->lock);
-   }
-   }
-
if (head->must_insert_reserved) {
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
@@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   /* Also free its reserved qgroup space */
-   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
- head->qgroup_reserved);
+   cleanup_ref_head_accounting(trans, head);
+
+   trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
btrfs_put_delayed_ref_head(head);
return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
+   cleanup_ref_head_accounting(trans, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e3ed3507018d..8a776dc9cb38 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct 
btrfs_delayed_ref_root *delayed_ref
btrfs_delayed_ref_unlock(head);
 }
 
-static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head)
+static struct btrfs_delayed_extent_op *
+cleanup_extent_op(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_head *head)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
-   int ret;
 
if (!extent_op)
-   return 0;
-   head->extent_op = NULL;
+   return NULL;
+
if (head->must_insert_reserved) {
+   head->extent_op = NULL;
btrfs_free_delayed_extent_op(extent_op);
-   return 0;
+   return NULL;
}
+   return extent_op;
+}
+
+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
+struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_delayed_extent_op *extent_op =
+   cleanup_extent_op(trans, head);
+   int ret;
+
+   if (!extent_op)
+   return 0;
+   head->extent_op = NULL;
spin_unlock(>lock);
ret = run_delayed_extent_op(trans, head, extent_op);
btrfs_free_delayed_extent_op(extent_op);
@@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
 
delayed_refs = >transaction->delayed_refs;
 
-   ret = cleanup_extent_op(trans, head);
+   ret = run_and_cleanup_extent_op(trans, head);
if (ret < 0) {
unselect_delayed_ref_head(delayed_refs, head);
btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!RB_EMPTY_ROOT(>ref_tree.rb_root))
goto out;
 
-   if (head->extent_op) {
-   if (!head->must_insert_reserved)
-   goto out;
-   btrfs_free_delayed_extent_op(head->extent_op);
-   head->extent_op = NULL;
-   }
+   if (cleanup_extent_op(trans, head) != NULL)
+   goto out;
 
/*
 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3



[PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/delayed-ref.c | 14 ++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 22 +++---
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..b3e4c9fcb664 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head)
+{
+   lockdep_assert_held(_refs->lock);
+   lockdep_assert_held(>lock);
+
+   rb_erase_cached(>href_node, _refs->href_root);
+   RB_CLEAR_NODE(>href_node);
+   atomic_dec(_refs->num_entries);
+   delayed_refs->num_heads--;
+   if (head->processing == 0)
+   delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..d2af974f68a1 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 {
mutex_unlock(>mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *btrfs_select_ref_head(
struct btrfs_delayed_ref_root *delayed_refs);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d242a1174e50..c36b3a42f2bb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(_refs->lock);
return 1;
}
-   delayed_refs->num_heads--;
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
-   atomic_dec(_refs->num_entries);
 
trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!mutex_trylock(>mutex))
goto out;
 
-   /*
-* at this point we have a head with no other entries.  Go
-* ahead and process it.
-*/
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
-   atomic_dec(_refs->num_entries);
-
-   /*
-* we don't take a ref on the node because we're removing it from the
-* tree, so we just steal the ref the tree was holding.
-*/
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
+   btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
+
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-- 
2.14.3



Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-11-20 Thread Josef Bacik
On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>btrfs_balance+0xbeb/0x1740 [btrfs]
>btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Turns out during page writeback it's possible that the code in
> writepage_delalloc can instantiate a delalloc range which doesn't
> belong to the page currently being written back. This happens since
> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
> range when asked and doens't really consider the range of the passed
> page. When such a foregin range is found the code proceeds to
> run_delalloc_range and calls the appropriate function to fill the
> delalloc and create ordered extents. If, however, a failure occurs
> while this operation is in effect then the clean up code in
> btrfs_cleanup_ordered_extents will be called. This function has the
> wrong assumption that caller of run_delalloc_range always properly
> cleans the first page of the range hence when it calls
> __endio_write_update_ordered it explicitly ommits the first page of
> the delalloc range. This is wrong because this function could be
> cleaning a delalloc range that doesn't belong to the current page. This
> in turn means that the page cleanup code in __extent_writepage will
> not really free the initial page for the range, leaving a hanging
> ordered extent with bytes_left set to 4k. This bug has been present
> ever since the original introduction of the cleanup code in 524272607e88.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so correctly adjust the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered 
> extent hang")

Can we just remove the endio cleanup in __extent_writepage() and make this do
the proper cleanup?  I'm not sure if that is feasible or not, but seems like it
would make the cleanup stuff less confusing and more straightforward.  If not
you can add

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] Btrfs: remove no longer used io_err from btrfs_log_ctx

2018-11-12 Thread Josef Bacik
On Mon, Nov 12, 2018 at 10:24:30AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The io_err field of struct btrfs_log_ctx is no longer used after the
> recent simplification of the fast fsync path, where we now wait for
> ordered extents to complete before logging the inode. We did this in
> commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync
> time") and commit a2120a473a80 ("btrfs: clean up the left over
> logged_list usage") removed its last use.
> 
> Signed-off-by: Filipe Manana 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] Btrfs: fix rare chances for data loss when doing a fast fsync

2018-11-12 Thread Josef Bacik
On Mon, Nov 12, 2018 at 10:23:58AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> After the simplification of the fast fsync patch done recently by commit
> b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and
> commit e7175a692765 ("btrfs: remove the wait ordered logic in the
> log_one_extent path"), we got a very short time window where we can get
> extents logged without writeback completing first or extents logged
> without logging the respective data checksums. Both issues can only happen
> when doing a non-full (fast) fsync.
> 
> As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
> inode and then wait for the writeback to complete before starting to log
> the inode. However before we acquire the inode's lock and after we started
> writeback, it's possible that more writes happened and dirtied more pages.
> If that happened and those pages get writeback triggered while we are
> logging the inode (for example, the VM subsystem triggering it due to
> memory pressure, or another concurrent fsync), we end up seeing the
> respective extent maps in the inode's list of modified extents and will
> log matching file extent items without waiting for the respective
> ordered extents to complete, meaning that either of the following will
> happen:
> 
> 1) We log an extent after its writeback finishes but before its checksums
>are added to the csum tree, leading to -EIO errors when attempting to
>read the extent after a log replay.
> 
> 2) We log an extent before its writeback finishes.
>Therefore after the log replay we will have a file extent item pointing
>to an unwritten extent (and without the respective data checksums as
>well).
> 
> This could not happen before the fast fsync patch simplification, because
> for any extent we found in the list of modified extents, we would wait for
> its respective ordered extent to finish writeback or collect its checksums
> for logging if it did not complete yet.
> 
> Fix this by triggering writeback again after acquiring the inode's lock
> and before waiting for ordered extents to complete.
> 
> Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the 
> log_one_extent path")
> Fixes: b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time")
> CC: sta...@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [Regression bisected] btrfs: always wait on ordered extents at fsync time

2018-11-12 Thread Josef Bacik
On Fri, Nov 09, 2018 at 10:56:42PM +, Mel Gorman wrote:
> On Fri, Nov 09, 2018 at 09:51:48PM +, Mel Gorman wrote:
> > Unfortunately, as
> > I'm about to travel, I didn't attempt a revert and a test comparing 4.18,
> > 4.19 and 4.20-rc1 is a few hours away so this could potentially be fixed
> > already but I didn't spot any obvious Fixes commit.
> > 
> 
> Still here a few hours later but the regression still appears to exist
> in mainline and the comparison report is below.  While there are slightly
> differences, the regressions are well outside multiples of the stddev and
> co-efficient of variance so I'm fairly sure it's real. The one positive
> thing is that the actual standard deviation is lower so the results are
> more stable but that is a thin silver lining.
> 

Yeah unfortunately this was expected, though in my tests I didn't see as harsh
of a regression.  I'll start working on making the regression suck less, thanks
for running this down Mel,

Josef


Re: [PATCH v15.1 04/13] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2018-11-09 Thread Josef Bacik
On Tue, Nov 06, 2018 at 02:41:13PM +0800, Lu Fengqi wrote:
> From: Wang Xiaoguang 
> 
> Introduce static function inmem_del() to remove hash from in-memory
> dedupe tree.
> And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.
> 
> Also for btrfs_dedupe_disable(), add new functions to wait existing
> writer and block incoming writers to eliminate all possible race.
> 
> Cc: Mark Fasheh 
> Signed-off-by: Qu Wenruo 
> Signed-off-by: Wang Xiaoguang 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/dedupe.c | 131 +++---
>  1 file changed, 125 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 784bb3a8a5ab..951fefd19fde 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -170,12 +170,6 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
>   return ret;
>  }
>  
> -int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
> -{
> - /* Place holder for bisect, will be implemented in later patches */
> - return 0;
> -}
> -
>  static int inmem_insert_hash(struct rb_root *root,
>struct inmem_hash *hash, int hash_len)
>  {
> @@ -317,3 +311,128 @@ int btrfs_dedupe_add(struct btrfs_fs_info *fs_info,
>   return inmem_add(dedupe_info, hash);
>   return -EINVAL;
>  }
> +
> +static struct inmem_hash *
> +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
> +{
> + struct rb_node **p = _info->bytenr_root.rb_node;
> + struct rb_node *parent = NULL;
> + struct inmem_hash *entry = NULL;
> +
> + while (*p) {
> + parent = *p;
> + entry = rb_entry(parent, struct inmem_hash, bytenr_node);
> +
> + if (bytenr < entry->bytenr)
> + p = &(*p)->rb_left;
> + else if (bytenr > entry->bytenr)
> + p = &(*p)->rb_right;
> + else
> + return entry;
> + }
> +
> + return NULL;
> +}
> +
> +/* Delete a hash from in-memory dedupe tree */
> +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
> +{
> + struct inmem_hash *hash;
> +
> + mutex_lock(_info->lock);
> + hash = inmem_search_bytenr(dedupe_info, bytenr);
> + if (!hash) {
> + mutex_unlock(_info->lock);
> + return 0;
> + }
> +
> + __inmem_del(dedupe_info, hash);
> + mutex_unlock(_info->lock);
> + return 0;
> +}
> +
> +/* Remove a dedupe hash from dedupe tree */
> +int btrfs_dedupe_del(struct btrfs_fs_info *fs_info, u64 bytenr)
> +{
> + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
> +
> + if (!fs_info->dedupe_enabled)
> + return 0;
> +
> + if (WARN_ON(dedupe_info == NULL))
> + return -EINVAL;
> +
> + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
> + return inmem_del(dedupe_info, bytenr);
> + return -EINVAL;
> +}
> +
> +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
> +{
> + struct inmem_hash *entry, *tmp;
> +
> + mutex_lock(_info->lock);
> + list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
> + __inmem_del(dedupe_info, entry);
> + mutex_unlock(_info->lock);
> +}
> +
> +/*
> + * Helper function to wait and block all incoming writers
> + *
> + * Use rw_sem introduced for freeze to wait/block writers.
> + * So during the block time, no new write will happen, so we can
> + * do something quite safe, espcially helpful for dedupe disable,
> + * as it affect buffered write.
> + */
> +static void block_all_writers(struct btrfs_fs_info *fs_info)
> +{
> + struct super_block *sb = fs_info->sb;
> +
> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
> + down_write(>s_umount);
> +}
> +
> +static void unblock_all_writers(struct btrfs_fs_info *fs_info)
> +{
> + struct super_block *sb = fs_info->sb;
> +
> + up_write(>s_umount);
> + percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
> +}

Please use the sb_ helpers, don't open code this.

> +
> +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_dedupe_info *dedupe_info;
> + int ret;
> +
> + dedupe_info = fs_info->dedupe_info;
> +
> + if (!dedupe_info)
> + return 0;
> +
> + /* Don't allow disable status change in RO mount */
> + if (fs_info->sb->s_flags & MS_RDONLY)
> + return -EROFS;
> +
> + /*
> +  * Wait for all unfinished writers and block further writers.
> +  * Then sync the whole fs so all current write will go through
> +  * dedupe, and all later write won't go through dedupe.
> +  */
> + block_all_writers(fs_info);
> + ret = sync_filesystem(fs_info->sb);
> + fs_info->dedupe_enabled = 0;
> + fs_info->dedupe_info = NULL;
> + unblock_all_writers(fs_info);

This is awful, don't do this.  Thanks,

Josef


Re: [PATCH v15.1 01/13] btrfs: dedupe: Introduce dedupe framework and its header

2018-11-09 Thread Josef Bacik
On Tue, Nov 06, 2018 at 02:41:10PM +0800, Lu Fengqi wrote:
> From: Wang Xiaoguang 
> 
> Introduce the header for btrfs in-band(write time) de-duplication
> framework and needed header.
> 
> The new de-duplication framework is going to support 2 different dedupe
> methods and 1 dedupe hash.
> 
> Signed-off-by: Qu Wenruo 
> Signed-off-by: Wang Xiaoguang 
> Signed-off-by: Lu Fengqi 
> ---
>  fs/btrfs/ctree.h   |   7 ++
>  fs/btrfs/dedupe.h  | 128 -
>  fs/btrfs/disk-io.c |   1 +
>  include/uapi/linux/btrfs.h |  34 ++
>  4 files changed, 168 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 80953528572d..910050d904ef 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1118,6 +1118,13 @@ struct btrfs_fs_info {
>   spinlock_t ref_verify_lock;
>   struct rb_root block_tree;
>  #endif
> +
> + /*
> +  * Inband de-duplication related structures
> +  */
> + unsigned long dedupe_enabled:1;

Please use a BTRFS_FS_ flag for this.  Thanks,

Josef


Re: [PATCH] btrfs: Check for missing device before bio submission in btrfs_map_bio

2018-11-09 Thread Josef Bacik
On Thu, Nov 08, 2018 at 04:16:38PM +0200, Nikolay Borisov wrote:
> Before btrfs_map_bio submits all stripe bio it does a number of checks
> to ensure the device for every stripe is present. However, it doesn't
> do a DEV_STATE_MISSING check, instead this is relegated to the lower
> level btrfs_schedule_bio (in the async submission case, sync submission
> doesn't check DEV_STATE_MISSING at all). Additionally
> btrfs_schedule_bios does the duplicate device->bdev check which has
> already been performed in btrfs_map_bio.
> 
> This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and
> removes the duplicate device->bdev check. Doing so ensures that no bio
> cloning/submission happens for both async/sync requests in the face of
> missing device. This makes the async io submission path slightly shorter
> in terms of instruction count. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 
> ---

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] Btrfs: simpler and more efficient cleanup of a log tree's extent io tree

2018-11-09 Thread Josef Bacik
On Fri, Nov 09, 2018 at 10:43:08AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> We currently are in a loop finding each range (corresponding to a btree
> node/leaf) in a log root's extent io tree and then clean it up. This is a
> waste of time since we are traversing the extent io tree's rb_tree more
> times then needed (one for a range lookup and another for cleaning it up)
> without any good reason.
> 
> We free the log trees when we are in the critical section of a transaction
> commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's
> of great convenience to do everything as fast as possible in order to
> reduce the time we block other tasks from starting a new transaction.
> 
> So fix this by traversing the extent io tree once and cleaning up all its
> records in one go while traversing it.
> 
> Signed-off-by: Filipe Manana 

Sheesh

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] Btrfs: fix data corruption due to cloning of eof block

2018-11-07 Thread Josef Bacik
On Mon, Nov 05, 2018 at 11:14:17AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> We currently allow cloning a range from a file which includes the last
> block of the file even if the file's size is not aligned to the block
> size. This is fine and useful when the destination file has the same size,
> but when it does not and the range ends somewhere in the middle of the
> destination file, it leads to corruption because the bytes between the EOF
> and the end of the block have undefined data (when there is support for
> discard/trimming they have a value of 0x00).
> 
> Example:
> 
>  $ mkfs.btrfs -f /dev/sdb
>  $ mount /dev/sdb /mnt
> 
>  $ export foo_size=$((256 * 1024 + 100))
>  $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
>  $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
> 
>  $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
> 
>  $ od -A d -t x1 /mnt/bar
>  000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
>  *
>  0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
>  *
>  0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
>  0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
>  *
>  1048576
> 
> The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
> (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
> of 0xb5.
> 
> This is similar to the problem we had for deduplication that got recently
> fixed by commit de02b9f6bb65 ("Btrfs: fix data corruption when
> deduplicating between different files").
> 
> Fix this by not allowing such operations to be performed and return the
> errno -EINVAL to user space. This is what XFS is doing as well at the VFS
> level. This change however now makes us return -EINVAL instead of
> -EOPNOTSUPP for cases where the source range maps to an inline extent and
> the destination range's end is smaller then the destination file's size,
> since the detection of inline extents is done during the actual process of
> dropping file extent items (at __btrfs_drop_extents()). Returning the
> -EINVAL error is done early on and solely based on the input parameters
> (offsets and length) and destination file's size. This makes us consistent
> with XFS and anyone else supporting cloning since this case is now checked
> at a higher level in the VFS and is where the -EINVAL will be returned
> from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
> by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into
> partial EOF block"). So this change is more geared towards stable kernels,
> as it's unlikely the new VFS checks get removed intentionally.
> 
> A test case for fstests follows soon, as well as an update to filter
> existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
> 
> CC:  # 4.4+
> Signed-off-by: Filipe Manana 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 8/8] btrfs: Remove extent_io_ops::split_extent_hook callback

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:53PM +0200, Nikolay Borisov wrote:
> This is the counterpart to merge_extent_hook, similarly, it's used only
> for data/freespace inodes so let's remove it, rename it and call it
> directly where necessary. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 
> ---

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 7/8] btrfs: Remove extent_io_ops::merge_extent_hook callback

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:52PM +0200, Nikolay Borisov wrote:
> This callback is used only for data and free space inodes. Such inodes
> are guaranteed to have their extent_io_tree::private_data set to the
> inode struct. Exploit this fact to directly call the function. Also
> give it a more descriptive name. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 
> ---

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 6/8] btrfs: Remove extent_io_ops::clear_bit_hook callback

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:51PM +0200, Nikolay Borisov wrote:
> This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
> similar to what was done before remove clear_bit_hook and rename the
> function. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 5/8] btrfs: Remove extent_io_ops::set_bit_hook extent_io callback

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:50PM +0200, Nikolay Borisov wrote:
> This callback is used to properly account delalloc extents for
> data inodes (ordinary file inodes and freespace v1 inodes). Those can
> be easily identified since they have their extent_io trees
> ->private_data member point to the inode. Let's exploit this fact to
> remove the needless indirection through extent_io_hooks and directly
> call the function. Also give the function a name which reflects its
> purpose - btrfs_set_delalloc_extent.
> 
> This patch also modified test_find_delalloc so that the extent_io_tree
> used for testing doesn't have its ->private_data set which would have
> caused a crash in btrfs_set_delalloc_extent due to the
> btrfs_inode->root member not being initialised. The old version of the
> code also didn't call set_bit_hook since the extent_io ops weren't set
> for the inode.  No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 4/8] btrfs: Remove extent_io_ops::check_extent_io_range callback

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:49PM +0200, Nikolay Borisov wrote:
> This callback was only used in debug builds by btrfs_leak_debug_check.
> A better approach is to move its implementation in
> btrfs_leak_debug_check and ensure the latter is only executed for
> extent tree which have ->private_data set i.e. relate to a data node and
> not the btree one. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 3/8] btrfs: Remove extent_io_ops::writepage_end_io_hook

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:48PM +0200, Nikolay Borisov wrote:
> This callback is ony ever called for data page writeout so
> there is no need to actually abstract it via extent_io_ops. Lets just
> export it, remove the definition of the callback and call it directly
> in the functions that invoke the callback. Also rename the function to
> btrfs_writepage_endio_finish_ordered since what it really does is
> account finished io in the ordered extent data structures.
> No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Could send another cleanup patch to remove the struct extent_state *state from
the arg list as well.

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 2/8] btrfs: Remove extent_io_ops::writepage_start_hook

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:47PM +0200, Nikolay Borisov wrote:
> This hook is called only from __extent_writepage_io which is already
> called only from the data page writeout path. So there is no need to
> make an indirect call via extent_io_ops. This patch just removes the
> callback definition, exports the callback function and calls it
> directly at the only call site. Also give the function a more descriptive
> name. No functional changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH 1/8] btrfs: Remove extent_io_ops::fill_delalloc

2018-11-01 Thread Josef Bacik
On Thu, Nov 01, 2018 at 02:09:46PM +0200, Nikolay Borisov wrote:
> This callback is called only from writepage_delalloc which in turn
> is guaranteed to be called from the data page writeout path. In the end
> there is no reason to have the call to this function to be indrected
> via the extent_io_ops structure. This patch removes the callback
> definition, exports the function and calls it directly. No functional
> changes.
> 
> Signed-off-by: Nikolay Borisov 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] fstests: fix fssum to actually ignore file holes when supposed to

2018-10-31 Thread Josef Bacik
On Mon, Oct 29, 2018 at 09:43:40AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Unless the '-s' option is passed to fssum, it should not detect file holes
> and have their existence influence the computed checksum for a file. This
> tool was added to test btrfs' send/receive feature, so that it checks for
> any metadata and data differences between the original filesystem and the
> filesystem that receives send streams.
> 
> For a long time the test btrfs/007, which tests btrfs' send/receive with
> fsstress, fails sporadically reporting data differences between files.
> However the md5sum/sha1sum from the reported files in the original and
> new filesystems are the same. The reason why fssum fails is because even
> in normal mode it still accounts for number of holes that exist in the
> file and their respective lengths. This is done using the SEEK_DATA mode
> of lseek. The btrfs send feature does not preserve holes nor prealloc
> extents (not supported by the current protocol), so whenever a hole or
> prealloc (unwritten) extent is detected in the source filesystem, it
> issues a write command full of zeroes, which will translate to a regular
> (written) extent in the destination filesystem. This is why fssum reports
> a different checksum. A prealloc extent also counts as hole when using
> lseek.
> 
> For example when passing a seed of 1540592967 to fsstress in btrfs/007,
> the test fails, as file p0/d0/f7 has a prealloc extent in the original
> filesystem (in the incr snapshot).
> 
> Fix this by making fssum just read the hole file and feed its data to the
> digest calculation function when option '-s' is not given. If we ever get
> btrfs' send/receive to support holes and fallocate, we can just change
> the test and pass the '-s' option to all fssum calls.
> 
> Signed-off-by: Filipe Manana 

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] Btrfs: fix missing data checksums after a ranged fsync (msync)

2018-10-31 Thread Josef Bacik
On Mon, Oct 29, 2018 at 09:42:06AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Recently we got a massive simplification for fsync, where for the fast
> path we no longer log new extents while their respective ordered extents
> are still running.
> 
> However that simplification introduced a subtle regression for the case
> where we use a ranged fsync (msync). Consider the following example:
> 
>CPU 0CPU 1
> 
> mmap write to range [2Mb, 4Mb[
>   mmap write to range [512Kb, 1Mb[
>   msync range [512K, 1Mb[
> --> triggers fast fsync
> (BTRFS_INODE_NEEDS_FULL_SYNC
>  not set)
> --> creates extent map A for this
> range and adds it to list of
> modified extents
> --> starts ordered extent A for
> this range
> --> waits for it to complete
> 
> writeback triggered for range
> [2Mb, 4Mb[
>   --> create extent map B and
>   adds it to the list of
>   modified extents
>   --> creates ordered extent B
> 
> --> start looking for and logging
> modified extents
> --> logs extent maps A and B
> --> finds checksums for extent A
> in the csum tree, but not for
> extent B
>   fsync (msync) finishes
> 
>   --> ordered extent B
>   finishes and its
>   checksums are added
>   to the csum tree
> 
> 
> 
> After replaying the log, we have the extent covering the range [2Mb, 4Mb[
> but do not have the data checksum items covering that file range.
> 
> This happens because at the very beginning of an fsync (btrfs_sync_file())
> we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
> wait for any ordered extents in that range to complete before we start
> logging the extents. However if right before we start logging the extent
> in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
> such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
> or msync (btrfs_sync_file() starts writeback before acquiring the inode's
> lock), an ordered extent is created for that other range and a new extent
> map is created to represent that range and added to the inode's list of
> modified extents.
> 
> That means that we will see that other extent in that list when collecting
> extents for logging (done at btrfs_log_changed_extents()) and log the
> extent before the respective ordered extent finishes - namely before the
> checksum items are added to the checksums tree, which is where
> log_extent_csums() looks for the checksums, therefore making us log an
> extent without logging its checksums. Before that massive simplification
> of fsync, this wasn't a problem because besides looking for checkums in
> the checksums tree, we also looked for them in any ordered extent still
> running.
> 
> The consequence of data checksums missing for a file range is that users
> attempting to read the affected file range will get -EIO errors and dmesg
> reports the following:
> 
>  [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 
> 57344
>  [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 
> 57344 csum 0x98f94189 expected csum 0x mirror 1
> 
> So fix this by skipping an extents outside of our logging range at
> btrfs_log_changed_extents() and leaving them on the list of modified
> extents so that any subsequent ranged fsync may collect them if needed.
> Also, if we find a hole extent outside of the range still log it, just
> to prevent having gaps between extent items after replaying the log,
> otherwise fsck will complain when we are not using the NO_HOLES feature
> (fstest btrfs/056 triggers such case).
> 
> Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the 
> log_one_extent path")
> CC: sta...@vger.kernel.org # 4.19+
> Signed-off-by: Filipe Manana 

Nice catch,

Reviewed-by: Josef Bacik 

Josef


Re: [PATCH 0/3] fix pinned underflow in generic/475

2018-10-25 Thread Josef Bacik
On Wed, Oct 24, 2018 at 08:24:00PM +0800, Lu Fengqi wrote:
> When running generic/475, pinned underflow may occur. This patch will
> fix this problem, but there are still other warnings need to addressed in
> this case.
> 
> Patch 1-2 introduce a macro and wrappers to help detect underflow
> Patch 3   the fix patch of pinned underflow
> 
> Lu Fengqi (2):
>   btrfs: extent-tree: Detect bytes_pinned underflow earlier
>   btrfs: fix pinned underflow after transaction aborted
> 
> Qu Wenruo (1):
>   btrfs: extent-tree: Detect bytes_may_use underflow earlier
> 

You can add

Reviewed-by: Josef Bacik 

to the series, thanks,

Josef


Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-24 Thread Josef Bacik
On Wed, Oct 24, 2018 at 01:48:40PM +0100, Filipe Manana wrote:
> On Wed, Oct 24, 2018 at 1:40 PM Josef Bacik  wrote:
> >
> > On Wed, Oct 24, 2018 at 12:53:59PM +0100, Filipe Manana wrote:
> > > On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik  wrote:
> > > >
> > > > On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote:
> > > > > From: Filipe Manana 
> > > > >
> > > > > When we are writing out a free space cache, during the transaction 
> > > > > commit
> > > > > phase, we can end up in a deadlock which results in a stack trace 
> > > > > like the
> > > > > following:
> > > > >
> > > > >  schedule+0x28/0x80
> > > > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > > > >  ? finish_wait+0x80/0x80
> > > > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > > > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > > > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > > > >  ? inode_insert5+0x119/0x190
> > > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > >  btrfs_iget+0x113/0x690 [btrfs]
> > > > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > > > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > > > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > > > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > > > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > > > >  ? finish_wait+0x80/0x80
> > > > >  find_free_extent+0x799/0x1010 [btrfs]
> > > > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > > > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > > > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > > > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > > > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > > > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > > > >  ? kmem_cache_alloc+0x166/0x1d0
> > > > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > > > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > > > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > > > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > > > >
> > > > > At cache_save_setup() we need to update the inode item of a block 
> > > > > group's
> > > > > cache which is located in the tree root (fs_info->tree_root), which 
> > > > > means
> > > > > that it may result in COWing a leaf from that tree. If that happens we
> > > > > need to find a free metadata extent and while looking for one, if we 
> > > > > find
> > > > > a block group which was not cached yet we attempt to load its cache by
> > > > > calling cache_block_group(). However this function will try to load 
> > > > > the
> > > > > inode of the free space cache, which requires finding the matching 
> > > > > inode
> > > > > item in the tree root - if that inode item is located in the same 
> > > > > leaf as
> > > > > the inode item of the space cache we are updating at 
> > > > > cache_save_setup(),
> > > > > we end up in a deadlock, since we try to obtain a read lock on the 
> > > > > same
> > > > > extent buffer that we previously write locked.
> > > > >
> > > > > So fix this by using the tree root's commit root when searching for a
> > > > > block group's free space cache inode item when we are attempting to 
> > > > > load
> > > > > a free space cache. This is safe since block groups once loaded stay 
> > > > > in
> > > > > memory forever, as well as their caches, so after they are first 
> > > > > loaded
> > > > > we will never need to read their inode items again. For new block 
> > > > > groups,
> > > > > once they are created they get their ->cached field set to
> > > > > BTRFS_CACHE_FINISHED meaning we will not need to read their inode 
> > > > > item.
> > > > >
> > > > > Reported-by: Andrew Nelson 
> > > > > Link: 
> > > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > > 

Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-24 Thread Josef Bacik
On Wed, Oct 24, 2018 at 12:53:59PM +0100, Filipe Manana wrote:
> On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik  wrote:
> >
> > On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote:
> > > From: Filipe Manana 
> > >
> > > When we are writing out a free space cache, during the transaction commit
> > > phase, we can end up in a deadlock which results in a stack trace like the
> > > following:
> > >
> > >  schedule+0x28/0x80
> > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > >  ? inode_insert5+0x119/0x190
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_iget+0x113/0x690 [btrfs]
> > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  find_free_extent+0x799/0x1010 [btrfs]
> > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > >
> > > At cache_save_setup() we need to update the inode item of a block group's
> > > cache which is located in the tree root (fs_info->tree_root), which means
> > > that it may result in COWing a leaf from that tree. If that happens we
> > > need to find a free metadata extent and while looking for one, if we find
> > > a block group which was not cached yet we attempt to load its cache by
> > > calling cache_block_group(). However this function will try to load the
> > > inode of the free space cache, which requires finding the matching inode
> > > item in the tree root - if that inode item is located in the same leaf as
> > > the inode item of the space cache we are updating at cache_save_setup(),
> > > we end up in a deadlock, since we try to obtain a read lock on the same
> > > extent buffer that we previously write locked.
> > >
> > > So fix this by using the tree root's commit root when searching for a
> > > block group's free space cache inode item when we are attempting to load
> > > a free space cache. This is safe since block groups once loaded stay in
> > > memory forever, as well as their caches, so after they are first loaded
> > > we will never need to read their inode items again. For new block groups,
> > > once they are created they get their ->cached field set to
> > > BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
> > >
> > > Reported-by: Andrew Nelson 
> > > Link: 
> > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > > Tested-by: Andrew Nelson 
> > > Signed-off-by: Filipe Manana 
> > > ---
> > >
> >
> > Now my goal is to see how many times I can get you to redo this thing.
> >
> > Why not instead just do
> >
> > if (btrfs_is_free_space_inode(inode))
> >   path->search_commit_root = 1;
> >
> > in read_locked_inode?  That would be cleaner.  If we don't want to do that 
> > for
> > the inode cache (I'm not sure if it's ok or not) we could just do
> >
> > if (root == fs_info->tree_root)
> 
> We can't (not just that at least).
> Tried something like that, but we get into a BUG_ON when writing out
> the space cache for new block groups (created in the current
> transaction).
> Because at cache_save_setup() we have this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c?h=v4.19#n3342
> 
> Lookup for the inode in normal root, doesn't exist, create it then
> repeat - if still not found, BUG_ON.
> Could also make create_free_space_inode() return an inode pointer and
> make it call btrfs_iget().
> 

Ah ok makes sense.  Well in that case lets just make btrfs_read_locked_inode()
take a path, and allocate it in btrfs_iget, that'll remove the ugly

if (path != in_path)

stuff.  Thanks,

Josef


Re: [PATCH v4] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-24 Thread Josef Bacik
On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by using the tree root's commit root when searching for a
> block group's free space cache inode item when we are attempting to load
> a free space cache. This is safe since block groups once loaded stay in
> memory forever, as well as their caches, so after they are first loaded
> we will never need to read their inode items again. For new block groups,
> once they are created they get their ->cached field set to
> BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
> 
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Tested-by: Andrew Nelson 
> Signed-off-by: Filipe Manana 
> ---
> 

Now my goal is to see how many times I can get you to redo this thing.

Why not instead just do 

if (btrfs_is_free_space_inode(inode))
  path->search_commit_root = 1;

in read_locked_inode?  That would be cleaner.  If we don't want to do that for
the inode cache (I'm not sure if it's ok or not) we could just do

if (root == fs_info->tree_root)

instead.  Thanks,

Josef


Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-23 Thread Josef Bacik
On Mon, Oct 22, 2018 at 11:05:08PM +0100, Filipe Manana wrote:
> On Mon, Oct 22, 2018 at 8:18 PM Josef Bacik  wrote:
> >
> > On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote:
> > > From: Filipe Manana 
> > >
> > > When we are writing out a free space cache, during the transaction commit
> > > phase, we can end up in a deadlock which results in a stack trace like the
> > > following:
> > >
> > >  schedule+0x28/0x80
> > >  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
> > >  btrfs_search_slot+0xf6/0x9f0 [btrfs]
> > >  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
> > >  ? inode_insert5+0x119/0x190
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_iget+0x113/0x690 [btrfs]
> > >  __lookup_free_space_inode+0xd8/0x150 [btrfs]
> > >  lookup_free_space_inode+0x5b/0xb0 [btrfs]
> > >  load_free_space_cache+0x7c/0x170 [btrfs]
> > >  ? cache_block_group+0x72/0x3b0 [btrfs]
> > >  cache_block_group+0x1b3/0x3b0 [btrfs]
> > >  ? finish_wait+0x80/0x80
> > >  find_free_extent+0x799/0x1010 [btrfs]
> > >  btrfs_reserve_extent+0x9b/0x180 [btrfs]
> > >  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
> > >  __btrfs_cow_block+0x11d/0x500 [btrfs]
> > >  btrfs_cow_block+0xdc/0x180 [btrfs]
> > >  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
> > >  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
> > >  ? kmem_cache_alloc+0x166/0x1d0
> > >  btrfs_update_inode_item+0x46/0x100 [btrfs]
> > >  cache_save_setup+0xe4/0x3a0 [btrfs]
> > >  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
> > >  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> > >
> > > At cache_save_setup() we need to update the inode item of a block group's
> > > cache which is located in the tree root (fs_info->tree_root), which means
> > > that it may result in COWing a leaf from that tree. If that happens we
> > > need to find a free metadata extent and while looking for one, if we find
> > > a block group which was not cached yet we attempt to load its cache by
> > > calling cache_block_group(). However this function will try to load the
> > > inode of the free space cache, which requires finding the matching inode
> > > item in the tree root - if that inode item is located in the same leaf as
> > > the inode item of the space cache we are updating at cache_save_setup(),
> > > we end up in a deadlock, since we try to obtain a read lock on the same
> > > extent buffer that we previously write locked.
> > >
> > > So fix this by skipping the loading of free space caches of any block
> > > groups that are not yet cached (rare cases) if we are COWing an extent
> > > buffer from the root tree and space caching is enabled (-o space_cache
> > > mount option). This is a rare case and its downside is failure to
> > > find a free extent (return -ENOSPC) when all the already cached block
> > > groups have no free extents.
> > >
> > > Reported-by: Andrew Nelson 
> > > Link: 
> > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> > > Tested-by: Andrew Nelson 
> > > Signed-off-by: Filipe Manana 
> >
> > Great, thanks,
> >
> > Reviewed-by: Josef Bacik 
> 
> So this makes many fstests occasionally fail with aborted transaction
> due to ENOSPC.
> It's late and I haven't verified yet, but I suppose this is because we
> always skip loading the cache regardless of currently being COWing an
> existing leaf or allocating a new one (growing the tree).
> Needs to be fixed.
> 

How about we just use path->search_commit_root?  If we're loading the cache we
just want the last committed version, it's not like we read it after we've
written it.  Then we can go back to business as usual.  Thanks,

Josef


Re: [PATCH v3] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 08:10:37PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are COWing an extent
> buffer from the root tree and space caching is enabled (-o space_cache
> mount option). This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Tested-by: Andrew Nelson 
> Signed-off-by: Filipe Manana 

Great, thanks,

Reviewed-by: Josef Bacik 

Josef


Re: [PATCH v2] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 07:48:30PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are COWing an extent
> buffer from the root tree and space caching is enabled (-o space_cache
> mount option). This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 
> Reported-by: Andrew Nelson 
> Link: 
> https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/
> Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists")
> Tested-by: Andrew Nelson 
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Made the solution more generic, since the problem could happen in any
> path COWing an extent buffer from the root tree.
> 
> Applies on top of a previous patch titled:
> 
>  "Btrfs: fix deadlock when writing out free space caches"
> 
>  fs/btrfs/ctree.c   |  4 
>  fs/btrfs/ctree.h   |  3 +++
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 15 ++-
>  4 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 089b46c4d97f..646aafda55a3 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1065,10 +1065,14 @@ static noinline int __btrfs_cow_block(struct 
> btrfs_trans_handle *trans,
>   root == fs_info->chunk_root ||
>   root == fs_info->dev_root)
>   trans->can_flush_pending_bgs = false;
> + else if (root == fs_info->tree_root)
> + atomic_inc(_info->tree_root_cows);
>  
>   cow = btrfs_alloc_tree_block(trans, root, parent_start,
>   root->root_key.objectid, _key, level,
>   search_start, empty_size);
> + if (root == fs_info->tree_root)
> + atomic_dec(_info->tree_root_cows);

Do we need this though?  Our root should be the root we're cow'ing the block
for, and it should be passed all the way down to find_free_extent properly, so
we really should be able to just do if (root == fs_info->tree_root) and not add
all this stuff.

Not to mention this will race with anybody else doing stuff, so if another
thread that isn't actually touching the tree_root it would skip caching a block
group when it's completely ok for that thread to do it.  Thanks,

Josef


Re: [PATCH] Btrfs: fix deadlock on tree root leaf when finding free extent

2018-10-22 Thread Josef Bacik
On Mon, Oct 22, 2018 at 10:09:46AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are writing out a free space cache, during the transaction commit
> phase, we can end up in a deadlock which results in a stack trace like the
> following:
> 
>  schedule+0x28/0x80
>  btrfs_tree_read_lock+0x8e/0x120 [btrfs]
>  ? finish_wait+0x80/0x80
>  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
>  btrfs_search_slot+0xf6/0x9f0 [btrfs]
>  ? evict_refill_and_join+0xd0/0xd0 [btrfs]
>  ? inode_insert5+0x119/0x190
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_iget+0x113/0x690 [btrfs]
>  __lookup_free_space_inode+0xd8/0x150 [btrfs]
>  lookup_free_space_inode+0x5b/0xb0 [btrfs]
>  load_free_space_cache+0x7c/0x170 [btrfs]
>  ? cache_block_group+0x72/0x3b0 [btrfs]
>  cache_block_group+0x1b3/0x3b0 [btrfs]
>  ? finish_wait+0x80/0x80
>  find_free_extent+0x799/0x1010 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
>  __btrfs_cow_block+0x11d/0x500 [btrfs]
>  btrfs_cow_block+0xdc/0x180 [btrfs]
>  btrfs_search_slot+0x3bd/0x9f0 [btrfs]
>  btrfs_lookup_inode+0x3a/0xc0 [btrfs]
>  ? kmem_cache_alloc+0x166/0x1d0
>  btrfs_update_inode_item+0x46/0x100 [btrfs]
>  cache_save_setup+0xe4/0x3a0 [btrfs]
>  btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
>  btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
> 
> At cache_save_setup() we need to update the inode item of a block group's
> cache which is located in the tree root (fs_info->tree_root), which means
> that it may result in COWing a leaf from that tree. If that happens we
> need to find a free metadata extent and while looking for one, if we find
> a block group which was not cached yet we attempt to load its cache by
> calling cache_block_group(). However this function will try to load the
> inode of the free space cache, which requires finding the matching inode
> item in the tree root - if that inode item is located in the same leaf as
> the inode item of the space cache we are updating at cache_save_setup(),
> we end up in a deadlock, since we try to obtain a read lock on the same
> extent buffer that we previously write locked.
> 
> So fix this by skipping the loading of free space caches of any block
> groups that are not yet cached (rare cases) if we are updating the inode
> of a free space cache. This is a rare case and its downside is failure to
> find a free extent (return -ENOSPC) when all the already cached block
> groups have no free extents.
> 

Actually isn't this a problem for anything that tries to allocate an extent
while in the tree_root?  Like we snapshot or make a subvolume or anything?  We
should just disallow if root == tree_root.  But even then we only need to do
this if we're using SPACE_CACHE, using the ye-olde caching or the free space
tree are both ok.  Let's just limit it to those cases.  Thanks,

Josef


Re: [PATCH] Btrfs: fix use-after-free when dumping free space

2018-10-22 Thread Josef Bacik
]  kthread+0x1d2/0x1f0
> [ 9520.409138]  ? btrfs_cleanup_transaction+0xb50/0xb50 [btrfs]
> [ 9520.409440]  ? kthread_park+0xb0/0xb0
> [ 9520.409682]  ret_from_fork+0x3a/0x50
> [ 9520.410508] Dumping ftrace buffer:
> [ 9520.410764](ftrace buffer empty)
> [ 9520.411007] CR2: 0011
> [ 9520.411297] ---[ end trace 01a0863445cf360a ]---
> [ 9520.411568] RIP: 0010:rb_next+0x3c/0x90
> [ 9520.412644] RSP: 0018:8801074ff780 EFLAGS: 00010292
> [ 9520.412932] RAX:  RBX: 0001 RCX: 
> 81b5ac4c
> [ 9520.413274] RDX:  RSI: 0008 RDI: 
> 0011
> [ 9520.413616] RBP: 8801074ff7a0 R08: ed0021d64ccc R09: 
> ed0021d64ccc
> [ 9520.414007] R10: 0001 R11: ed0021d64ccb R12: 
> 8800b91e
> [ 9520.414349] R13: 8800a3ceba48 R14: 8800b627bf80 R15: 
> 0002
> [ 9520.416074] FS:  () GS:88010eb0() 
> knlGS:
> [ 9520.416536] CS:  0010 DS:  ES:  CR0: 80050033
> [ 9520.416848] CR2: 0011 CR3: 000106b52000 CR4: 
> 06a0
> [ 9520.418477] DR0:  DR1:  DR2: 
> 
> [ 9520.418846] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 9520.419204] Kernel panic - not syncing: Fatal exception
> [ 9520.419666] Dumping ftrace buffer:
> [ 9520.419930](ftrace buffer empty)
> [ 9520.420168] Kernel Offset: disabled
> [ 9520.420406] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> Fix this by acquiring the respective lock before iterating the rbtree.
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Filipe Manana 

Reviewed-by: Josef Bacik 

Thanks,

Josef


[PATCH 33/42] btrfs: fix insert_reserved error handling

2018-10-12 Thread Josef Bacik
We were not handling the reserved byte accounting properly for data
references.  Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases.  So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.

Reviewed-by: David Sterba 
Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 933cba06c9fb..9b5366932298 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
   insert_reserved);
else
BUG();
+   if (ret && insert_reserved)
+   btrfs_pin_extent(trans->fs_info, node->bytenr,
+node->num_bytes, 1);
return ret;
 }
 
@@ -8257,21 +8260,14 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
}
 
path = btrfs_alloc_path();
-   if (!path) {
-   btrfs_free_and_pin_reserved_extent(fs_info,
-  extent_key.objectid,
-  fs_info->nodesize);
+   if (!path)
return -ENOMEM;
-   }
 
path->leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
  _key, size);
if (ret) {
btrfs_free_path(path);
-   btrfs_free_and_pin_reserved_extent(fs_info,
-  extent_key.objectid,
-  fs_info->nodesize);
return ret;
}
 
-- 
2.14.3



[PATCH 31/42] btrfs: cleanup pending bgs on transaction abort

2018-10-12 Thread Josef Bacik
We may abort the transaction during a commit and not have a chance to
run the pending bgs stuff, which will leave block groups on our list and
cause us accounting issues and leaked memory.  Fix this by running the
pending bgs when we cleanup a transaction.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 46ca775a709e..9168efaca37e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2280,6 +2280,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_scrub_continue(fs_info);
 cleanup_transaction:
btrfs_trans_release_metadata(trans);
+   /* This cleans up the pending block groups list properly. */
+   if (!trans->aborted)
+   trans->aborted = ret;
+   btrfs_create_pending_block_groups(trans);
btrfs_trans_release_chunk_metadata(trans);
trans->block_rsv = NULL;
btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
-- 
2.14.3



  1   2   3   4   5   6   7   8   9   10   >