On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote: > On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <[email protected]> wrote: > > On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote: > >> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: > >> > [BUG] > >> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > >> > and leads to kernel assertion on outstanding extents in > >> > run_delalloc_nocow() and cow_file_range(). > >> > > >> > BTRFS info (device vdb5): relocating block group 12582912 flags data > >> > BTRFS info (device vdb5): found 1 extents > >> > assertion failed: inode->outstanding_extents >= num_extents, file: > >> > fs/btrfs//extent-tree.c, line: 5858 > >> > > >> > Currently, due to another bug blocking ordered extents, the bug is only > >> > reproducible under certain block group layout and using error injection. > >> > > >> > a) Create one data block group with one 4K extent in it. > >> > To avoid the bug that hangs btrfs due to ordered extent which never > >> > finishes > >> > b) Make btrfs_reloc_clone_csum() always fail > >> > c) Relocate that block group > >> > > >> > [CAUSE] > >> > run_delalloc_nocow() and cow_file_range() handles error from > >> > btrfs_reloc_clone_csum() wrongly: > >> > > >> > (The ascii chart shows a more generic case of this bug other than the > >> > bug mentioned above) > >> > > >> > |<------------------ delalloc range --------------------------->| > >> > | OE 1 | OE 2 | ... | OE n | > >> > |<----------- cleanup range --------------->| > >> > |<----------- ----------->| > >> > \/ > >> > btrfs_finish_ordered_io() range > >> > > >> > So error handler, which calls extent_clear_unlock_delalloc() with > >> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > >> > will both cover OE n, and free its metadata, causing metadata under flow. > >> > > >> > [Fix] > >> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > >> > call error handler after increasing the iteration offset, so that > >> > cleanup range won't cover any created ordered extent. > >> > > >> > |<------------------ delalloc range --------------------------->| > >> > | OE 1 | OE 2 | ... | OE n | > >> > |<----------- ----------->|<---------- cleanup range --------->| > >> > \/ > >> > btrfs_finish_ordered_io() range > >> > > >> > Signed-off-by: Qu Wenruo <[email protected]> > >> > --- > >> > changelog: > >> > v6: > >> > New, split from v5 patch, as this is a separate bug. > >> > --- > >> > fs/btrfs/inode.c | 51 > >> > +++++++++++++++++++++++++++++++++++++++------------ > >> > 1 file changed, 39 insertions(+), 12 deletions(-) > >> > > >> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > >> > index b2bc07aad1ae..1d83d504f2e5 100644 > >> > --- a/fs/btrfs/inode.c > >> > +++ b/fs/btrfs/inode.c > >> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode > >> > *inode, > >> > BTRFS_DATA_RELOC_TREE_OBJECTID) { > >> > ret = btrfs_reloc_clone_csums(inode, start, > >> > cur_alloc_size); > >> > + /* > >> > + * Only drop cache here, and process as normal. > >> > + * > >> > + * We must not allow extent_clear_unlock_delalloc() > >> > + * at out_unlock label to free meta of this ordered > >> > + * extent, as its meta should be freed by > >> > + * btrfs_finish_ordered_io(). > >> > + * > >> > + * So we must continue until @start is increased to > >> > + * skip current ordered extent. > >> > + */ > >> > if (ret) > >> > - goto out_drop_extent_cache; > >> > + btrfs_drop_extent_cache(BTRFS_I(inode), > >> > start, > >> > + start + ram_size - 1, 0); > >> > } > >> > > >> > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > >> > > >> > - if (disk_num_bytes < cur_alloc_size) > >> > - break; > >> > - > >> > /* we're not doing compressed IO, don't unlock the first > >> > * page (which the caller expects to stay locked), don't > >> > * clear any dirty bits and don't set any writeback bits > >> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode > >> > *inode, > >> > delalloc_end, locked_page, > >> > EXTENT_LOCKED | EXTENT_DELALLOC, > >> > op); > >> > - disk_num_bytes -= cur_alloc_size; > >> > + if (disk_num_bytes > cur_alloc_size) > >> > + disk_num_bytes = 0; > >> > + else > >> > + disk_num_bytes -= cur_alloc_size; > >> > >> I don't get the logic here, why do we 'break' if disk_num_bytes > > >> cur_alloc_size? > > > > I assume that you've run fstests against this patch, if so, I actually > > start worrying about that no fstests found this problem. > > The operator is definitely wrong, should be < instead of > (previous > patch versions were correct). > > It's not a surprise fstests don't catch this - one would need a fs > with no more free space to allocate new data chunks and existing data > chunks would have to be fragmented to the point we need to allocate > two or more smaller extents to satisfy the write, so whether fstests > exercises this depends on the size of the test devices (mine are 100G > for example) and the size of the data the tests create. Having a > deterministic test for that for that should be possible and useful.
I see, I think that 'mount -o fragment=data' could help us get it. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
