On Wed, Oct 24, 2018 at 1:40 PM Josef Bacik <jo...@toxicpanda.com> wrote: > > On Wed, Oct 24, 2018 at 12:53:59PM +0100, Filipe Manana wrote: > > On Wed, Oct 24, 2018 at 12:37 PM Josef Bacik <jo...@toxicpanda.com> wrote: > > > > > > On Wed, Oct 24, 2018 at 10:13:03AM +0100, fdman...@kernel.org wrote: > > > > From: Filipe Manana <fdman...@suse.com> > > > > > > > > When we are writing out a free space cache, during the transaction > > > > commit > > > > phase, we can end up in a deadlock which results in a stack trace like > > > > the > > > > following: > > > > > > > > schedule+0x28/0x80 > > > > btrfs_tree_read_lock+0x8e/0x120 [btrfs] > > > > ? finish_wait+0x80/0x80 > > > > btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > > > > btrfs_search_slot+0xf6/0x9f0 [btrfs] > > > > ? evict_refill_and_join+0xd0/0xd0 [btrfs] > > > > ? inode_insert5+0x119/0x190 > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > btrfs_iget+0x113/0x690 [btrfs] > > > > __lookup_free_space_inode+0xd8/0x150 [btrfs] > > > > lookup_free_space_inode+0x5b/0xb0 [btrfs] > > > > load_free_space_cache+0x7c/0x170 [btrfs] > > > > ? cache_block_group+0x72/0x3b0 [btrfs] > > > > cache_block_group+0x1b3/0x3b0 [btrfs] > > > > ? finish_wait+0x80/0x80 > > > > find_free_extent+0x799/0x1010 [btrfs] > > > > btrfs_reserve_extent+0x9b/0x180 [btrfs] > > > > btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs] > > > > __btrfs_cow_block+0x11d/0x500 [btrfs] > > > > btrfs_cow_block+0xdc/0x180 [btrfs] > > > > btrfs_search_slot+0x3bd/0x9f0 [btrfs] > > > > btrfs_lookup_inode+0x3a/0xc0 [btrfs] > > > > ? kmem_cache_alloc+0x166/0x1d0 > > > > btrfs_update_inode_item+0x46/0x100 [btrfs] > > > > cache_save_setup+0xe4/0x3a0 [btrfs] > > > > btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs] > > > > btrfs_commit_transaction+0xcb/0x8b0 [btrfs] > > > > > > > > At cache_save_setup() we need to update the inode item of a block > > > > group's > > > > cache which is located in the tree root (fs_info->tree_root), which > > > > means > > > > that it may result in COWing a leaf from that tree. If that happens we > > > > need to find a free metadata extent and while looking for one, if we > > > > find > > > > a block group which was not cached yet we attempt to load its cache by > > > > calling cache_block_group(). However this function will try to load the > > > > inode of the free space cache, which requires finding the matching inode > > > > item in the tree root - if that inode item is located in the same leaf > > > > as > > > > the inode item of the space cache we are updating at cache_save_setup(), > > > > we end up in a deadlock, since we try to obtain a read lock on the same > > > > extent buffer that we previously write locked. > > > > > > > > So fix this by using the tree root's commit root when searching for a > > > > block group's free space cache inode item when we are attempting to load > > > > a free space cache. This is safe since block groups once loaded stay in > > > > memory forever, as well as their caches, so after they are first loaded > > > > we will never need to read their inode items again. For new block > > > > groups, > > > > once they are created they get their ->cached field set to > > > > BTRFS_CACHE_FINISHED meaning we will not need to read their inode item. > > > > > > > > Reported-by: Andrew Nelson <andrew.s.nel...@gmail.com> > > > > Link: > > > > https://lore.kernel.org/linux-btrfs/captelenq9x5kowuq+fa7h1r3nsjg8vyith8+ifjurc_duhh...@mail.gmail.com/ > > > > Fixes: 9d66e233c704 ("Btrfs: load free space cache if it exists") > > > > Tested-by: Andrew Nelson <andrew.s.nel...@gmail.com> > > > > Signed-off-by: Filipe Manana <fdman...@suse.com> > > > > --- > > > > > > > > > > Now my goal is to see how many times I can get you to redo this thing. > > > > > > Why not instead just do > > > > > > if (btrfs_is_free_space_inode(inode)) > > > path->search_commit_root = 1; > > > > > > in read_locked_inode? That would be cleaner. If we don't want to do > > > that for > > > the inode cache (I'm not sure if it's ok or not) we could just do > > > > > > if (root == fs_info->tree_root) > > > > We can't (not just that at least). > > Tried something like that, but we get into a BUG_ON when writing out > > the space cache for new block groups (created in the current > > transaction). > > Because at cache_save_setup() we have this: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c?h=v4.19#n3342 > > > > Lookup for the inode in normal root, doesn't exist, create it then > > repeat - if still not found, BUG_ON. > > Could also make create_free_space_inode() return an inode pointer and > > make it call btrfs_iget(). > > > > Ah ok makes sense. Well in that case lets just make btrfs_read_locked_inode() > take a path, and allocate it in btrfs_iget, that'll remove the ugly > > if (path != in_path)
You mean the following on top of v4: https://friendpaste.com/6XrGXb5p0RSJGixUFYouHg Not much different, just saves one such if statement. I'm ok with that. > > stuff. Thanks, > > Josef