Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead

Liu Bo Wed, 10 Jun 2015 06:42:17 -0700

On Wed, Jun 10, 2015 at 12:51:15PM +0100, Filipe Manana wrote:
> On Wed, Jun 10, 2015 at 12:27 PM, Liu Bo <[email protected]> wrote:
> > On Wed, Jun 10, 2015 at 12:12:11PM +0100, Filipe Manana wrote:
> >> On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <[email protected]> wrote:
> >> > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote:
> >> >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <[email protected]> wrote:
> >> >> > On Tue, May 26, 2015 at 12:55:42AM +0100, [email protected] wrote:
> >> >> >> From: Filipe Manana <[email protected]>
> >> >> >>
> >> >> >> Zygo Blaxell and other users have reported occasional hangs while an
> >> >> >> inode is being evicted, leading to traces like the following:
> >> >> >>
> >> >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
> >> >> >> [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
> >> >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> >> >> >> disables this message.
> >> >> >> [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 
> >> >> >> 0x00000000
> >> >> >> [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 
> >> >> >> 0000000000000001
> >> >> >> [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 
> >> >> >> ffff8801540a5f78
> >> >> >> [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 
> >> >> >> 0000000000000123
> >> >> >> [ 5281.981396] Call Trace:
> >> >> >> [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
> >> >> >> [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
> >> >> >> [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
> >> >> >> [ 5281.986715]  [<ffffffffa03b4b71>] 
> >> >> >> wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
> >> >> >> [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 
> >> >> >> [btrfs]
> >> >> >> [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be 
> >> >> >> [btrfs]
> >> >> >> [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
> >> >> >> [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
> >> >> >> [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
> >> >> >> [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
> >> >> >> [ 5281.996120]  [<ffffffff8107d314>] ? 
> >> >> >> trace_hardirqs_on_caller+0x18f/0x1ab
> >> >> >> [ 5281.997562]  [<ffffffff8123960b>] ? 
> >> >> >> trace_hardirqs_on_thunk+0x3a/0x3f
> >> >> >> [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
> >> >> >> [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
> >> >> >> [ 5282.001299] 1 lock held by rm/20488:
> >> >> >> [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: 
> >> >> >> [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
> >> >> >>
> >> >> >> This happens when we have readahead, which calls readpages(), 
> >> >> >> happening
> >> >> >> right before the inode eviction handler is invoked. So the reason is
> >> >> >> essentially:
> >> >> >>
> >> >> >> 1) readpages() is called while a reference on the inode is held, so
> >> >> >>    eviction can not be triggered before readpages() returns. It also
> >> >> >>    locks one or more ranges in the inode's io_tree (which is done at
> >> >> >>    extent_io.c:__do_contiguous_readpages());
> >> >> >>
> >> >> >> 2) readpages() submits several read bios, all with an end io callback
> >> >> >>    that runs extent_io.c:end_bio_extent_readpage() and that is 
> >> >> >> executed
> >> >> >>    by other task when a bio finishes, corresponding to a work queue
> >> >> >>    (fs_info->end_io_workers) worker kthread. This callback unlocks
> >> >> >>    the ranges in the inode's io_tree that were previously locked in
> >> >> >>    step 1;
> >> >> >>
> >> >> >> 3) readpages() returns, the reference on the inode is dropped;
> >> >> >>
> >> >> >> 4) One or more of the read bios previously submitted are still not
> >> >> >>    complete (their end io callback was not yet invoked or has not
> >> >> >>    yet finished execution);
> >> >> >>
> >> >> >> 5) Inode eviction is triggered (through an unlink call for example).
> >> >> >>    The inode reference count was not incremented before submitting
> >> >> >>    the read bios, therefore this is possible;
> >> >> >>
> >> >> >> 6) The eviction handler starts executing and enters the loop that
> >> >> >>    iterates over all extent states in the inode's io_tree;
> >> >> >
> >> >> > One question here:
> >> >> >  btrfs_evict_inode()->
> >> >> >         evict_inode_truncate_pages()->
> >> >> >                 truncate_inode_pages_final()
> >> >> >
> >> >> > truncate_inode_pages_final() finds all pages in the mapping and
> >> >> > lock_page() on each page, if there is a readpage running by other 
> >> >> > tasks,
> >> >> > it should be blocked on lock_page() until end_bio_extent_readpage()
> >> >> > does a unlock_page().
> >> >> >
> >> >> > How does eviction handler bypasses this and enters 'extent_state' 
> >> >> > loop?
> >> >> >
> >> >> > Am I missing something?
> >> >>
> >> >> Yes, you are missing something.
> >> >> end_bio_extent_readpage() unlocks the pages and only after it does
> >> >> unlock the ranges in the inode's io tree.
> >> >
> >> > ah, that's right, another thing is
> >> > truncate_inode_pages_final() gets every page's lock, which means that
> >> > all the endio must have been processed, but the below comment in the
> >> > code says some bios may still not get their endio called.
> >>
> >> truncate_inode_pages_final() can lock the pages after
> >> end_bio_extent_readpage() unlocks them and before
> >> end_bio_extent_readpage() unlocks the ranges in the io tree. I know
> >> there's some mangling around that makes the bio's original endio
> >> function to be run by an work queue (fs_info->endio_workers) task plus
> >> one or two more levels of indirection. But by endio I really meant
> >> end_bio_extent_readpage(), as I mention in the comment between
> >> parenthesis, and not any of the indirections added by
> >> btrfs_submit_bio_hook().
> >
> >
> > Excerpted from the below comment,
> >
> >> >> >> +      * We can have ongoing bios started by readpages (called from 
> >> >> >> readahead)
> >> >> >> +      * that didn't get their end io callbacks called yet or they 
> >> >> >> are still
> >> >> >> +      * in progress ((extent_io.c:end_bio_extent_readpage()).
> >
> > I mean that by the time of truncate_inode_pages_final() ending, all the 
> > end_bio_extent_readpage()
> > have already been called since pages are needed to be unlock'd to make 
> > evict() go on.
> 
> When you say "have already been called" I understood completely called
> and not still in progress.
> Yes, by that time there's only the possibility of being in progress.
> I'll send a patch to fix the comment since Chris already picked this.


Okay.

Reviewed-by: Liu Bo <[email protected]>

Thanks,

-liubo

> 
> >
> > Thanks,
> >
> > -liubo
> >
> >>
> >> >
> >> > Thanks,
> >> >
> >> > -liubo
> >> >>
> >> >> thanks
> >> >>
> >> >>
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > -liubo
> >> >> >>
> >> >> >> 7) The loop picks one extent state record and uses its ->start and
> >> >> >>    ->end fields, after releasing the inode's io_tree spinlock, to
> >> >> >>    call lock_extent_bits() and clear_extent_bit(). The call to lock
> >> >> >>    the range [state->start, state->end] blocks because the whole
> >> >> >>    range or a part of it was locked by the previous call to
> >> >> >>    readpages() and the corresponding end io callback, which unlocks
> >> >> >>    the range was not yet executed;
> >> >> >>
> >> >> >> 8) The end io callback for the read bio is executed and unlocks the
> >> >> >>    range [state->start, state->end] (or a superset of that range).
> >> >> >>    And at clear_extent_bit() the extent_state record state is used
> >> >> >>    as a second argument to split_state(), which sets state->start to
> >> >> >>    a larger value;
> >> >> >>
> >> >> >> 9) The task executing the eviction handler is woken up by the task
> >> >> >>    executing the bio's end io callback (through clear_state_bit) and
> >> >> >>    the eviction handler locks the range
> >> >> >>    [old value for state->start, state->end]. Shortly after, when
> >> >> >>    calling clear_extent_bit(), it unlocks the range
> >> >> >>    [new value for state->start, state->end], so it ends up unlocking
> >> >> >>    only part of the range that it locked, leaving an extent state
> >> >> >>    record in the io_tree that represents the unlocked subrange;
> >> >> >>
> >> >> >> 10) The eviction handler loop, in its next iteration, gets the
> >> >> >>     extent_state record for the subrange that it did not unlock in 
> >> >> >> the
> >> >> >>     previous step and then tries to lock it, resulting in an hang.
> >> >> >>
> >> >> >> So fix this by not using the ->start and ->end fields of an existing
> >> >> >> extent_state record. This is a simple solution, and an alternative
> >> >> >> could be to bump the inode's reference count before submitting each
> >> >> >> read bio and having it dropped in the bio's end io callback. But that
> >> >> >> would be a more invasive/complex change and would not protect against
> >> >> >> other possible places that are not holding a reference on the inode
> >> >> >> as well. Something to consider in the future.
> >> >> >>
> >> >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the
> >> >> >> issue, a set of scripts to trigger it and testing this fix.
> >> >> >>
> >> >> >> Reported-by: Zygo Blaxell <[email protected]>
> >> >> >> Tested-by: Zygo Blaxell <[email protected]>
> >> >> >> Signed-off-by: Filipe Manana <[email protected]>
> >> >> >> ---
> >> >> >>  fs/btrfs/inode.c | 26 +++++++++++++++++++++-----
> >> >> >>  1 file changed, 21 insertions(+), 5 deletions(-)
> >> >> >>
> >> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >> >> >> index 0020b56..ef05800 100644
> >> >> >> --- a/fs/btrfs/inode.c
> >> >> >> +++ b/fs/btrfs/inode.c
> >> >> >> @@ -4987,24 +4987,40 @@ static void 
> >> >> >> evict_inode_truncate_pages(struct inode *inode)
> >> >> >>       }
> >> >> >>       write_unlock(&map_tree->lock);
> >> >> >>
> >> >> >> +     /*
> >> >> >> +      * Keep looping until we have no more ranges in the io tree.
> >> >> >> +      * We can have ongoing bios started by readpages (called from 
> >> >> >> readahead)
> >> >> >> +      * that didn't get their end io callbacks called yet or they 
> >> >> >> are still
> >> >> >> +      * in progress ((extent_io.c:end_bio_extent_readpage()). This 
> >> >> >> means some
> >> >> >> +      * ranges can still be locked and eviction started because 
> >> >> >> before
> >> >> >> +      * submitting those bios, which are executed by a separate 
> >> >> >> task (work
> >> >> >> +      * queue kthread), inode references (inode->i_count) were not 
> >> >> >> taken
> >> >> >> +      * (which would be dropped in the end io callback of each bio).
> >> >> >> +      * Therefore here we effectively end up waiting for those bios 
> >> >> >> and
> >> >> >> +      * anyone else holding locked ranges without having bumped the 
> >> >> >> inode's
> >> >> >> +      * reference count - if we don't do it, when they access the 
> >> >> >> inode's
> >> >> >> +      * io_tree to unlock a range it may be too late, leading to an
> >> >> >> +      * use-after-free issue.
> >> >> >> +      */
> >> >> >>       spin_lock(&io_tree->lock);
> >> >> >>       while (!RB_EMPTY_ROOT(&io_tree->state)) {
> >> >> >>               struct extent_state *state;
> >> >> >>               struct extent_state *cached_state = NULL;
> >> >> >> +             u64 start;
> >> >> >> +             u64 end;
> >> >> >>
> >> >> >>               node = rb_first(&io_tree->state);
> >> >> >>               state = rb_entry(node, struct extent_state, rb_node);
> >> >> >> -             atomic_inc(&state->refs);
> >> >> >> +             start = state->start;
> >> >> >> +             end = state->end;
> >> >> >>               spin_unlock(&io_tree->lock);
> >> >> >>
> >> >> >> -             lock_extent_bits(io_tree, state->start, state->end,
> >> >> >> -                              0, &cached_state);
> >> >> >> -             clear_extent_bit(io_tree, state->start, state->end,
> >> >> >> +             lock_extent_bits(io_tree, start, end, 0, 
> >> >> >> &cached_state);
> >> >> >> +             clear_extent_bit(io_tree, start, end,
> >> >> >>                                EXTENT_LOCKED | EXTENT_DIRTY |
> >> >> >>                                EXTENT_DELALLOC | 
> >> >> >> EXTENT_DO_ACCOUNTING |
> >> >> >>                                EXTENT_DEFRAG, 1, 1,
> >> >> >>                                &cached_state, GFP_NOFS);
> >> >> >> -             free_extent_state(state);
> >> >> >>
> >> >> >>               cond_resched();
> >> >> >>               spin_lock(&io_tree->lock);
> >> >> >> --
> >> >> >> 2.1.3
> >> >> >>
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe 
> >> >> >> linux-btrfs" in
> >> >> >> the body of a message to [email protected]
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix hang during inode eviction due to concurrent readahead

Reply via email to