On Fri, May 09, 2014 at 03:42:01AM +0100, Filipe David Borba Manana wrote:
> When running low on available disk space and having several processes
> doing buffered file IO, I got the following trace in dmesg:
>
> [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
> [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
> [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2
> 0x00000000
> [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
> [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490
> 00000000001d3f40
> [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920
> ffff880203ac2490
> [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8
> 0000000000000001
> [ 4202.720922] Call Trace:
> [ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70
> [ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110
> [btrfs]
> [ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
> [ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40
> [btrfs]
> [ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
> [ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530
> (...)
> [ 4202.721027] 2 locks held by kworker/u8:1/5450:
> [ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>]
> process_one_work+0x193/0x530
> [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>]
> process_one_work+0x193/0x530
> [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
> [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
> [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 4202.721699] btrfs D 0000000000000001 0 7891 7890
> 0x00000001
> [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490
> 00000000001d3f40
> [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000
> ffff8800a33d2490
> [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8
> 7fffffffffffffff
> [ 4202.721718] Call Trace:
> [ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70
> [ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
> [ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
> [ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
> [ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
> [ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
> [ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
> [ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280
> [btrfs]
> [ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0
> [btrfs]
> [ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
> [ 4202.721799] [<ffffffffa02056a9>]
> btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
> [ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170
> [btrfs]
> (...)
>
> It turns out that extent_io.c:__extent_writepage(), which ends up being called
> through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was
> getting
> -ENOSPC when calling the fill_delalloc callback. In this situation, it
> returned
> without the writepage_end_io_hook callback
> (inode.c:btrfs_writepage_end_io_hook)
> ever being called for the respective page, which prevents the ordered extent's
> bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
> is never queued into the endio_write_workers queue. This makes the task that
> called btrfs_start_ordered_extent() hang forever on the wait queue of the
> ordered
> extent.
Nice fix, I've comfirmed the hang on my box, but things could be more complex,
more below,
>
> This is fairly easy to reproduce using a small filesystem and fsstress on
> a quad core vm:
>
> mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
> mount /dev/sdd /mnt
>
> fsstress -p 6 -d /mnt -n 100000 -x \
> "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
> -f allocsp=0 \
> -f bulkstat=0 \
> -f bulkstat1=0 \
> -f chown=0 \
> -f creat=1 \
> -f dread=0 \
> -f dwrite=0 \
> -f fallocate=1 \
> -f fdatasync=0 \
> -f fiemap=0 \
> -f freesp=0 \
> -f fsync=0 \
> -f getattr=0 \
> -f getdents=0 \
> -f link=0 \
> -f mkdir=0 \
> -f mknod=0 \
> -f punch=1 \
> -f read=0 \
> -f readlink=0 \
> -f rename=0 \
> -f resvsp=0 \
> -f rmdir=0 \
> -f setxattr=0 \
> -f stat=0 \
> -f symlink=0 \
> -f sync=0 \
> -f truncate=1 \
> -f unlink=0 \
> -f unresvsp=0 \
> -f write=4
>
> Signed-off-by: Filipe David Borba Manana <[email protected]>
> ---
>
> V2: Grab error from `em` pointer if available, do the error check and
> end_extent_writepage call before unlocking the page (just like
> end_bio_extent_writepage does).
>
> fs/btrfs/extent_io.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0c43896..b0c73ad 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3284,6 +3284,7 @@ static int __extent_writepage(struct page *page, struct
> writeback_control *wbc,
> end - cur + 1, 1);
> if (IS_ERR_OR_NULL(em)) {
> SetPageError(page);
> + ret = PTR_ERR_OR_ZERO(em);
> break;
> }
>
> @@ -3370,13 +3371,17 @@ done:
> set_page_writeback(page);
> end_page_writeback(page);
> }
> + if (PageError(page)) {
> + ret = ret < 0 ? ret : -EIO;
> + end_extent_writepage(page, ret, start, page_end);
> + }
> unlock_page(page);
>
> done_unlocked:
>
> /* drop our reference on any cached states */
> free_extent_state(cached_state);
> - return 0;
> + return ret;
So 'return ret' can makes extent_write_cache_pages() stop after this
__extent_writepage, and if the page happens to be the last one of the current
pagevec of the for() loop that extent_write_cache_pages() is using, the rest
pages will never get be processed.
Try to image this, on lack of space, we'd get ENOSPC on btrfs_reserve_extent()
which is in cow_file_range() -- a normal choice for run_delalloc_range(), and
for page N, it's the locked_page we use in cow_file_range(), and the delalloc
range also involves other pages that are after page N. Now we get ENOSPC on
allocating space for page N+2 rather the page N, which means page N and page N+1
have been on the ordered list, then as this patch does, we bail out and return
the -ENOSPC error to caller, now if the above situation occurs, page N+1 will
not
be processed, then the page N+1 's ordered extent will not be finished either
since we don't submit that page at all.
However, this is a rare case, I don't run into that(I guess that's because it
has low possibility that the page happens to be the last one of pagevec).
So my suggestion is, let this locked page go for the normal submit_bio path if
it has been set up properly for ordered write. Those pages that don't get space
in cow_file_range() has been cleaned up through the error handling path, but
what's lacked here is that these pages are not marked via SetPageError, we might
need to add that in extent_clear_unlock_delalloc().
(but I could be wrong, so cc maintainers for ideas)
thanks,
-liubo
> }
>
> static int eb_wait(void *word)
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html