On Wed, Jun 10, 2015 at 11:41 AM, Liu Bo <[email protected]> wrote: > On Tue, May 26, 2015 at 12:55:42AM +0100, [email protected] wrote: >> From: Filipe Manana <[email protected]> >> >> Zygo Blaxell and other users have reported occasional hangs while an >> inode is being evicted, leading to traces like the following: >> >> [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. >> [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 >> [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >> this message. >> [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 >> 0x00000000 >> [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 >> 0000000000000001 >> [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 >> ffff8801540a5f78 >> [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e >> 0000000000000123 >> [ 5281.981396] Call Trace: >> [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 >> [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] >> [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 >> [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde >> [btrfs] >> [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] >> [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] >> [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 >> [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 >> [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa >> [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 >> [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab >> [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b >> [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 >> [ 5282.001299] 1 lock held by rm/20488: >> [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] >> mnt_want_write+0x24/0x4b >> >> This happens when we have readahead, which calls readpages(), happening >> right before the inode eviction handler is invoked. So the reason is >> essentially: >> >> 1) readpages() is called while a reference on the inode is held, so >> eviction can not be triggered before readpages() returns. It also >> locks one or more ranges in the inode's io_tree (which is done at >> extent_io.c:__do_contiguous_readpages()); >> >> 2) readpages() submits several read bios, all with an end io callback >> that runs extent_io.c:end_bio_extent_readpage() and that is executed >> by other task when a bio finishes, corresponding to a work queue >> (fs_info->end_io_workers) worker kthread. This callback unlocks >> the ranges in the inode's io_tree that were previously locked in >> step 1; >> >> 3) readpages() returns, the reference on the inode is dropped; >> >> 4) One or more of the read bios previously submitted are still not >> complete (their end io callback was not yet invoked or has not >> yet finished execution); >> >> 5) Inode eviction is triggered (through an unlink call for example). >> The inode reference count was not incremented before submitting >> the read bios, therefore this is possible; >> >> 6) The eviction handler starts executing and enters the loop that >> iterates over all extent states in the inode's io_tree; > > One question here: > btrfs_evict_inode()-> > evict_inode_truncate_pages()-> > truncate_inode_pages_final() > > truncate_inode_pages_final() finds all pages in the mapping and > lock_page() on each page, if there is a readpage running by other tasks, > it should be blocked on lock_page() until end_bio_extent_readpage() > does a unlock_page(). > > How does eviction handler bypasses this and enters 'extent_state' loop? > > Am I missing something?
Yes, you are missing something. end_bio_extent_readpage() unlocks the pages and only after it does unlock the ranges in the inode's io tree. thanks > > Thanks, > > -liubo >> >> 7) The loop picks one extent state record and uses its ->start and >> ->end fields, after releasing the inode's io_tree spinlock, to >> call lock_extent_bits() and clear_extent_bit(). The call to lock >> the range [state->start, state->end] blocks because the whole >> range or a part of it was locked by the previous call to >> readpages() and the corresponding end io callback, which unlocks >> the range was not yet executed; >> >> 8) The end io callback for the read bio is executed and unlocks the >> range [state->start, state->end] (or a superset of that range). >> And at clear_extent_bit() the extent_state record state is used >> as a second argument to split_state(), which sets state->start to >> a larger value; >> >> 9) The task executing the eviction handler is woken up by the task >> executing the bio's end io callback (through clear_state_bit) and >> the eviction handler locks the range >> [old value for state->start, state->end]. Shortly after, when >> calling clear_extent_bit(), it unlocks the range >> [new value for state->start, state->end], so it ends up unlocking >> only part of the range that it locked, leaving an extent state >> record in the io_tree that represents the unlocked subrange; >> >> 10) The eviction handler loop, in its next iteration, gets the >> extent_state record for the subrange that it did not unlock in the >> previous step and then tries to lock it, resulting in an hang. >> >> So fix this by not using the ->start and ->end fields of an existing >> extent_state record. This is a simple solution, and an alternative >> could be to bump the inode's reference count before submitting each >> read bio and having it dropped in the bio's end io callback. But that >> would be a more invasive/complex change and would not protect against >> other possible places that are not holding a reference on the inode >> as well. Something to consider in the future. >> >> Many thanks to Zygo Blaxell for reporting, in the mailing list, the >> issue, a set of scripts to trigger it and testing this fix. >> >> Reported-by: Zygo Blaxell <[email protected]> >> Tested-by: Zygo Blaxell <[email protected]> >> Signed-off-by: Filipe Manana <[email protected]> >> --- >> fs/btrfs/inode.c | 26 +++++++++++++++++++++----- >> 1 file changed, 21 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 0020b56..ef05800 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -4987,24 +4987,40 @@ static void evict_inode_truncate_pages(struct inode >> *inode) >> } >> write_unlock(&map_tree->lock); >> >> + /* >> + * Keep looping until we have no more ranges in the io tree. >> + * We can have ongoing bios started by readpages (called from >> readahead) >> + * that didn't get their end io callbacks called yet or they are still >> + * in progress ((extent_io.c:end_bio_extent_readpage()). This means >> some >> + * ranges can still be locked and eviction started because before >> + * submitting those bios, which are executed by a separate task (work >> + * queue kthread), inode references (inode->i_count) were not taken >> + * (which would be dropped in the end io callback of each bio). >> + * Therefore here we effectively end up waiting for those bios and >> + * anyone else holding locked ranges without having bumped the inode's >> + * reference count - if we don't do it, when they access the inode's >> + * io_tree to unlock a range it may be too late, leading to an >> + * use-after-free issue. >> + */ >> spin_lock(&io_tree->lock); >> while (!RB_EMPTY_ROOT(&io_tree->state)) { >> struct extent_state *state; >> struct extent_state *cached_state = NULL; >> + u64 start; >> + u64 end; >> >> node = rb_first(&io_tree->state); >> state = rb_entry(node, struct extent_state, rb_node); >> - atomic_inc(&state->refs); >> + start = state->start; >> + end = state->end; >> spin_unlock(&io_tree->lock); >> >> - lock_extent_bits(io_tree, state->start, state->end, >> - 0, &cached_state); >> - clear_extent_bit(io_tree, state->start, state->end, >> + lock_extent_bits(io_tree, start, end, 0, &cached_state); >> + clear_extent_bit(io_tree, start, end, >> EXTENT_LOCKED | EXTENT_DIRTY | >> EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING | >> EXTENT_DEFRAG, 1, 1, >> &cached_state, GFP_NOFS); >> - free_extent_state(state); >> >> cond_resched(); >> spin_lock(&io_tree->lock); >> -- >> 2.1.3 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to [email protected] >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
