On Thu, Sep 12, 2019 at 02:19:55PM +0100, Filipe Manana wrote:
> On Thu, Sep 12, 2019 at 1:18 PM Josef Bacik <jo...@toxicpanda.com> wrote:
> >
> > On Tue, Sep 10, 2019 at 03:26:49PM +0100, fdman...@kernel.org wrote:
> > > From: Filipe Manana <fdman...@suse.com>
> > >
> > > Sometimes when fsync'ing a file we need to log that other inodes exist and
> > > when we need to do that we acquire a reference on the inodes and then drop
> > > that reference using iput() after logging them.
> > >
> > > That generally is not a problem except if we end up doing the final iput()
> > > (dropping the last reference) on the inode and that inode has a link count
> > > of 0, which can happen in a very short time window if the logging path
> > > gets a reference on the inode while it's being unlinked.
> > >
> > > In that case we end up getting the eviction callback, btrfs_evict_inode(),
> > > invoked through the iput() call chain which needs to drop all of the
> > > inode's items from its subvolume btree, and in order to do that, it needs
> > > to join a transaction at the helper function evict_refill_and_join().
> > > However because the task previously started a transaction at the fsync
> > > handler, btrfs_sync_file(), it has current->journal_info already pointing
> > > to a transaction handle and therefore evict_refill_and_join() will get
> > > that transaction handle from btrfs_join_transaction(). From this point on,
> > > two different problems can happen:
> > >
> > > 1) evict_refill_and_join() will often change the transaction handle's
> > >    block reserve (->block_rsv) and set its ->bytes_reserved field to a
> > >    value greater than 0. If evict_refill_and_join() never commits the
> > >    transaction, the eviction handler ends up decreasing the reference
> > >    count (->use_count) of the transaction handle through the call to
> > >    btrfs_end_transaction(), and after that point we have a transaction
> > >    handle with a NULL ->block_rsv (which is the value prior to the
> > >    transaction join from evict_refill_and_join()) and a ->bytes_reserved
> > >    value greater than 0. If after the eviction/iput completes the inode
> > >    logging path hits an error or it decides that it must fallback to a
> > >    transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
> > >    non-zero value from btrfs_log_dentry_safe(), and because of that
> > >    non-zero value it tries to commit the transaction using a handle with
> > >    a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
> > >    the transaction commit hit an assertion failure at
> > >    btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
> > >    the ->block_rsv is NULL. The produced stack trace for that is like the
> > >    following:
> > >
> > >    [192922.917158] assertion failed: !trans->bytes_reserved, file: 
> > > fs/btrfs/transaction.c, line: 816
> > >    [192922.917553] ------------[ cut here ]------------
> > >    [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
> > >    [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> > >    [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W     
> > >     5.1.4-btrfs-next-47 #1
> > >    [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> > >    [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
> > >    (...)
> > >    [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
> > >    [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 
> > > 0000000000000000
> > >    [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: 
> > > ffff95cab6b16838
> > >    [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 
> > > 0000000000000000
> > >    [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: 
> > > ffff95ca731a0980
> > >    [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: 
> > > ffff95ca731a0ea8
> > >    [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) 
> > > knlGS:0000000000000000
> > >    [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >    [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 
> > > 00000000003606e0
> > >    [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> > > 0000000000000000
> > >    [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> > > 0000000000000400
> > >    [192922.925105] Call Trace:
> > >    [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
> > >    [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
> > >    [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
> > >    [192922.926731]  do_fsync+0x38/0x60
> > >    [192922.927138]  __x64_sys_fdatasync+0x13/0x20
> > >    [192922.927543]  do_syscall_64+0x60/0x1c0
> > >    [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >    (...)
> > >    [192922.934077] ---[ end trace f00808b12068168f ]---
> > >
> > > 2) If evict_refill_and_join() decides to commit the transaction, it will
> > >    be able to do it, since the nested transaction join only increments the
> > >    transaction handle's ->use_count reference counter and it does not
> > >    prevent the transaction from getting committed. This means that after
> > >    eviction completes, the fsync logging path will be using a transaction
> > >    handle that refers to an already committed transaction. What happens
> > >    when using such a stale transaction can be unpredictable, we are at
> > >    least having a use-after-free on the transaction handle itself, since
> > >    the transaction commit will call kmem_cache_free() against the handle
> > >    regardless of its ->use_count value, or we can end up silently losing
> > >    all the updates to the log tree after that iput() in the logging path,
> > >    or using a transaction handle that in the meanwhile was allocated to
> > >    another task for a new transaction, etc, pretty much unpredictable
> > >    what can happen.
> > >
> >
> > And talking it over with Nikolay I realized that since we're doing the 
> > commit
> > through the flushing this doesn't actually happen anymore.
> > may_commit_transaction() returns -EGAIN if we already have a trans handle 
> > open,
> > so hooray I made it safer by accident!  But we definitely should follow up 
> > and
> > add an assert in btrfs_commit_transaction() to catch this case, cause holy 
> > shit
> > that's bad.  Thanks,
> 
> Yep, but problem 1) (assertion failure) is still valid after your
> recent changes in misc-next.
> Either way, this fix has to be added to stable releases to fix both
> problems there.
> 

Agreed, just adding context for the current changes.  This patch is fine as it
is, just some follow up stuff to catch transaction commits under nested trans
handles.  Thanks,

Josef

Reply via email to