On Wed, Jun 10, 2015 at 12:51:15PM +0100, Filipe Manana wrote: > On Wed, Jun 10, 2015 at 12:27 PM, Liu Bo <[email protected]> wrote: > > On Wed, Jun 10, 2015 at 12:12:11PM +0100, Filipe Manana wrote: > >> On Wed, Jun 10, 2015 at 12:05 PM, Liu Bo <[email protected]> wrote: > >> > On Wed, Jun 10, 2015 at 11:44:17AM +0100, Filipe Manana wrote: > >> >> On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <[email protected]> wrote: > >> >> > On Tue, May 26, 2015 at 12:55:42AM +0100, [email protected] wrote: > >> >> >> From: Filipe Manana <[email protected]> > >> >> >> > >> >> >> Zygo Blaxell and other users have reported occasional hangs while an > >> >> >> inode is being evicted, leading to traces like the following: > >> >> >> > >> >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. > >> >> >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 > >> >> >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >> >> >> disables this message. > >> >> >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 > >> >> >> 0x00000000 > >> >> >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 > >> >> >> 0000000000000001 > >> >> >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 > >> >> >> ffff8801540a5f78 > >> >> >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e > >> >> >> 0000000000000123 > >> >> >> [ 5281.981396] Call Trace: > >> >> >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 > >> >> >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] > >> >> >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 > >> >> >> [ 5281.986715] [<ffffffffa03b4b71>] > >> >> >> wait_extent_bit.constprop.32+0x7c/0xde [btrfs] > >> >> >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 > >> >> >> [btrfs] > >> >> >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be > >> >> >> [btrfs] > >> >> >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 > >> >> >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 > >> >> >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa > >> >> >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 > >> >> >> [ 5281.996120] [<ffffffff8107d314>] ? > >> >> >> trace_hardirqs_on_caller+0x18f/0x1ab > >> >> >> [ 5281.997562] [<ffffffff8123960b>] ? > >> >> >> trace_hardirqs_on_thunk+0x3a/0x3f > >> >> >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b > >> >> >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 > >> >> >> [ 5282.001299] 1 lock held by rm/20488: > >> >> >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: > >> >> >> [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b > >> >> >> > >> >> >> This happens when we have readahead, which calls readpages(), > >> >> >> happening > >> >> >> right before the inode eviction handler is invoked. So the reason is > >> >> >> essentially: > >> >> >> > >> >> >> 1) readpages() is called while a reference on the inode is held, so > >> >> >> eviction can not be triggered before readpages() returns. It also > >> >> >> locks one or more ranges in the inode's io_tree (which is done at > >> >> >> extent_io.c:__do_contiguous_readpages()); > >> >> >> > >> >> >> 2) readpages() submits several read bios, all with an end io callback > >> >> >> that runs extent_io.c:end_bio_extent_readpage() and that is > >> >> >> executed > >> >> >> by other task when a bio finishes, corresponding to a work queue > >> >> >> (fs_info->end_io_workers) worker kthread. This callback unlocks > >> >> >> the ranges in the inode's io_tree that were previously locked in > >> >> >> step 1; > >> >> >> > >> >> >> 3) readpages() returns, the reference on the inode is dropped; > >> >> >> > >> >> >> 4) One or more of the read bios previously submitted are still not > >> >> >> complete (their end io callback was not yet invoked or has not > >> >> >> yet finished execution); > >> >> >> > >> >> >> 5) Inode eviction is triggered (through an unlink call for example). > >> >> >> The inode reference count was not incremented before submitting > >> >> >> the read bios, therefore this is possible; > >> >> >> > >> >> >> 6) The eviction handler starts executing and enters the loop that > >> >> >> iterates over all extent states in the inode's io_tree; > >> >> > > >> >> > One question here: > >> >> > btrfs_evict_inode()-> > >> >> > evict_inode_truncate_pages()-> > >> >> > truncate_inode_pages_final() > >> >> > > >> >> > truncate_inode_pages_final() finds all pages in the mapping and > >> >> > lock_page() on each page, if there is a readpage running by other > >> >> > tasks, > >> >> > it should be blocked on lock_page() until end_bio_extent_readpage() > >> >> > does a unlock_page(). > >> >> > > >> >> > How does eviction handler bypasses this and enters 'extent_state' > >> >> > loop? > >> >> > > >> >> > Am I missing something? > >> >> > >> >> Yes, you are missing something. > >> >> end_bio_extent_readpage() unlocks the pages and only after it does > >> >> unlock the ranges in the inode's io tree. > >> > > >> > ah, that's right, another thing is > >> > truncate_inode_pages_final() gets every page's lock, which means that > >> > all the endio must have been processed, but the below comment in the > >> > code says some bios may still not get their endio called. > >> > >> truncate_inode_pages_final() can lock the pages after > >> end_bio_extent_readpage() unlocks them and before > >> end_bio_extent_readpage() unlocks the ranges in the io tree. I know > >> there's some mangling around that makes the bio's original endio > >> function to be run by an work queue (fs_info->endio_workers) task plus > >> one or two more levels of indirection. But by endio I really meant > >> end_bio_extent_readpage(), as I mention in the comment between > >> parenthesis, and not any of the indirections added by > >> btrfs_submit_bio_hook(). > > > > > > Excerpted from the below comment, > > > >> >> >> + * We can have ongoing bios started by readpages (called from > >> >> >> readahead) > >> >> >> + * that didn't get their end io callbacks called yet or they > >> >> >> are still > >> >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). > > > > I mean that by the time of truncate_inode_pages_final() ending, all the > > end_bio_extent_readpage() > > have already been called since pages are needed to be unlock'd to make > > evict() go on. > > When you say "have already been called" I understood completely called > and not still in progress. > Yes, by that time there's only the possibility of being in progress. > I'll send a patch to fix the comment since Chris already picked this.
Okay. Reviewed-by: Liu Bo <[email protected]> Thanks, -liubo > > > > > Thanks, > > > > -liubo > > > >> > >> > > >> > Thanks, > >> > > >> > -liubo > >> >> > >> >> thanks > >> >> > >> >> > >> >> > > >> >> > Thanks, > >> >> > > >> >> > -liubo > >> >> >> > >> >> >> 7) The loop picks one extent state record and uses its ->start and > >> >> >> ->end fields, after releasing the inode's io_tree spinlock, to > >> >> >> call lock_extent_bits() and clear_extent_bit(). The call to lock > >> >> >> the range [state->start, state->end] blocks because the whole > >> >> >> range or a part of it was locked by the previous call to > >> >> >> readpages() and the corresponding end io callback, which unlocks > >> >> >> the range was not yet executed; > >> >> >> > >> >> >> 8) The end io callback for the read bio is executed and unlocks the > >> >> >> range [state->start, state->end] (or a superset of that range). > >> >> >> And at clear_extent_bit() the extent_state record state is used > >> >> >> as a second argument to split_state(), which sets state->start to > >> >> >> a larger value; > >> >> >> > >> >> >> 9) The task executing the eviction handler is woken up by the task > >> >> >> executing the bio's end io callback (through clear_state_bit) and > >> >> >> the eviction handler locks the range > >> >> >> [old value for state->start, state->end]. Shortly after, when > >> >> >> calling clear_extent_bit(), it unlocks the range > >> >> >> [new value for state->start, state->end], so it ends up unlocking > >> >> >> only part of the range that it locked, leaving an extent state > >> >> >> record in the io_tree that represents the unlocked subrange; > >> >> >> > >> >> >> 10) The eviction handler loop, in its next iteration, gets the > >> >> >> extent_state record for the subrange that it did not unlock in > >> >> >> the > >> >> >> previous step and then tries to lock it, resulting in an hang. > >> >> >> > >> >> >> So fix this by not using the ->start and ->end fields of an existing > >> >> >> extent_state record. This is a simple solution, and an alternative > >> >> >> could be to bump the inode's reference count before submitting each > >> >> >> read bio and having it dropped in the bio's end io callback. But that > >> >> >> would be a more invasive/complex change and would not protect against > >> >> >> other possible places that are not holding a reference on the inode > >> >> >> as well. Something to consider in the future. > >> >> >> > >> >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the > >> >> >> issue, a set of scripts to trigger it and testing this fix. > >> >> >> > >> >> >> Reported-by: Zygo Blaxell <[email protected]> > >> >> >> Tested-by: Zygo Blaxell <[email protected]> > >> >> >> Signed-off-by: Filipe Manana <[email protected]> > >> >> >> --- > >> >> >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- > >> >> >> 1 file changed, 21 insertions(+), 5 deletions(-) > >> >> >> > >> >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> >> >> index 0020b56..ef05800 100644 > >> >> >> --- a/fs/btrfs/inode.c > >> >> >> +++ b/fs/btrfs/inode.c > >> >> >> @@ -4987,24 +4987,40 @@ static void > >> >> >> evict_inode_truncate_pages(struct inode *inode) > >> >> >> } > >> >> >> write_unlock(&map_tree->lock); > >> >> >> > >> >> >> + /* > >> >> >> + * Keep looping until we have no more ranges in the io tree. > >> >> >> + * We can have ongoing bios started by readpages (called from > >> >> >> readahead) > >> >> >> + * that didn't get their end io callbacks called yet or they > >> >> >> are still > >> >> >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This > >> >> >> means some > >> >> >> + * ranges can still be locked and eviction started because > >> >> >> before > >> >> >> + * submitting those bios, which are executed by a separate > >> >> >> task (work > >> >> >> + * queue kthread), inode references (inode->i_count) were not > >> >> >> taken > >> >> >> + * (which would be dropped in the end io callback of each bio). > >> >> >> + * Therefore here we effectively end up waiting for those bios > >> >> >> and > >> >> >> + * anyone else holding locked ranges without having bumped the > >> >> >> inode's > >> >> >> + * reference count - if we don't do it, when they access the > >> >> >> inode's > >> >> >> + * io_tree to unlock a range it may be too late, leading to an > >> >> >> + * use-after-free issue. > >> >> >> + */ > >> >> >> spin_lock(&io_tree->lock); > >> >> >> while (!RB_EMPTY_ROOT(&io_tree->state)) { > >> >> >> struct extent_state *state; > >> >> >> struct extent_state *cached_state = NULL; > >> >> >> + u64 start; > >> >> >> + u64 end; > >> >> >> > >> >> >> node = rb_first(&io_tree->state); > >> >> >> state = rb_entry(node, struct extent_state, rb_node); > >> >> >> - atomic_inc(&state->refs); > >> >> >> + start = state->start; > >> >> >> + end = state->end; > >> >> >> spin_unlock(&io_tree->lock); > >> >> >> > >> >> >> - lock_extent_bits(io_tree, state->start, state->end, > >> >> >> - 0, &cached_state); > >> >> >> - clear_extent_bit(io_tree, state->start, state->end, > >> >> >> + lock_extent_bits(io_tree, start, end, 0, > >> >> >> &cached_state); > >> >> >> + clear_extent_bit(io_tree, start, end, > >> >> >> EXTENT_LOCKED | EXTENT_DIRTY | > >> >> >> EXTENT_DELALLOC | > >> >> >> EXTENT_DO_ACCOUNTING | > >> >> >> EXTENT_DEFRAG, 1, 1, > >> >> >> &cached_state, GFP_NOFS); > >> >> >> - free_extent_state(state); > >> >> >> > >> >> >> cond_resched(); > >> >> >> spin_lock(&io_tree->lock); > >> >> >> -- > >> >> >> 2.1.3 > >> >> >> > >> >> >> -- > >> >> >> To unsubscribe from this list: send the line "unsubscribe > >> >> >> linux-btrfs" in > >> >> >> the body of a message to [email protected] > >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
