On Wed, Oct 31, 2018 at 06:40:29PM +0200, Nikolay Borisov wrote: > > > On 31.10.18 г. 18:35 ч., Omar Sandoval wrote: > > On Wed, Oct 31, 2018 at 10:43:02AM +0200, Nikolay Borisov wrote: > >> > >> > >> On 31.10.18 г. 2:14 ч., Omar Sandoval wrote: > >>> From: Omar Sandoval <osan...@fb.com> > >>> > >>> There's a race between close_ctree() and cleaner_kthread(). > >>> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > >>> sees it set, but this is racy; the cleaner might have already checked > >>> the bit and could be cleaning stuff. In particular, if it deletes unused > >>> block groups, it will create delayed iputs for the free space cache > >>> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > >>> longer running delayed iputs after a commit. Therefore, if the cleaner > >>> creates more delayed iputs after delayed iputs are run in > >>> btrfs_commit_super(), we will leak inodes on unmount and get a busy > >>> inode crash from the VFS. > >>> > >>> Fix it by parking the cleaner before we actually close anything. Then, > >>> any remaining delayed iputs will always be handled in > >>> btrfs_commit_super(). This also ensures that the commit in close_ctree() > >>> is really the last commit, so we can get rid of the commit in > >>> cleaner_kthread(). > >>> > >>> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") > >>> Signed-off-by: Omar Sandoval <osan...@fb.com> > >>> --- > >>> We found this with a stress test that our containers team runs. I'm > >>> wondering if this same race could have caused any other issues other > >>> than this new iput thing, but I couldn't identify any. > >>> > >>> fs/btrfs/disk-io.c | 40 +++++++--------------------------------- > >>> 1 file changed, 7 insertions(+), 33 deletions(-) > >>> > >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > >>> index b0ab41da91d1..7c17284ae3c2 100644 > >>> --- a/fs/btrfs/disk-io.c > >>> +++ b/fs/btrfs/disk-io.c > >>> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg) > >>> struct btrfs_root *root = arg; > >>> struct btrfs_fs_info *fs_info = root->fs_info; > >>> int again; > >>> - struct btrfs_trans_handle *trans; > >>> > >>> - do { > >>> + while (1) { > >>> again = 0; > >>> > >>> /* Make the cleaner go to sleep early. */ > >>> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg) > >>> */ > >>> btrfs_delete_unused_bgs(fs_info); > >>> sleep: > >>> + if (kthread_should_park()) > >>> + kthread_parkme(); > >>> + if (kthread_should_stop()) > >>> + return 0; > >>> if (!again) { > >>> set_current_state(TASK_INTERRUPTIBLE); > >>> - if (!kthread_should_stop()) > >>> - schedule(); > >>> + schedule(); > >>> __set_current_state(TASK_RUNNING); > >>> } > >>> - } while (!kthread_should_stop()); > >>> - > >>> - /* > >>> - * Transaction kthread is stopped before us and wakes us up. > >>> - * However we might have started a new transaction and COWed some > >>> - * tree blocks when deleting unused block groups for example. So > >>> - * make sure we commit the transaction we started to have a clean > >>> - * shutdown when evicting the btree inode - if it has dirty pages > >>> - * when we do the final iput() on it, eviction will trigger a > >>> - * writeback for it which will fail with null pointer dereferences > >>> - * since work queues and other resources were already released and > >>> - * destroyed by the time the iput/eviction/writeback is made. > >>> - */ > >>> - trans = btrfs_attach_transaction(root); > >>> - if (IS_ERR(trans)) { > >>> - if (PTR_ERR(trans) != -ENOENT) > >>> - btrfs_err(fs_info, > >>> - "cleaner transaction attach returned %ld", > >>> - PTR_ERR(trans)); > >>> - } else { > >>> - int ret; > >>> - > >>> - ret = btrfs_commit_transaction(trans); > >>> - if (ret) > >>> - btrfs_err(fs_info, > >>> - "cleaner open transaction commit returned %d", > >>> - ret); > >>> } > >>> - > >>> - return 0; > >>> } > >>> > >>> static int transaction_kthread(void *arg) > >>> @@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) > >>> int ret; > >>> > >>> set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > >>> + kthread_park(fs_info->cleaner_kthread); > >> > >> Can't you directly call kthread_stop here? When you park the thread it > >> will sleep and then when you call kthread_stop that function will unpark > >> the thread and the cleaner kthread will see KTHREAD_SHOULD_STOP bit and > >> just return 0. So the from the moment the thread is parked until it's > >> stopped it doesn't have a chance to do useful work. > > > > kthread_stop() frees the task_struct, but we might still try to wake up > > the cleaner kthread from somewhere (e.g., from the transaction kthread). > > So we really need to keep the cleaner alive but not doing work. > > This dependency then needs to be documented via a comment or at the very > least mentioned in the changelog. Is it possible to refactor the code > (in a different patch) to actually ensure that transaction is stopped > and then kthread as well to remove this dependency ?
Then we get the same issue with things trying to wake up the transaction kthread after it's been freed (__btrfs_end_transaction(), in particular). Maybe we could make it work, but it seems very fragile. I'll send a v2 with a comment.