On Fri, May 9, 2014 at 11:52 AM, Liu Bo <[email protected]> wrote:
> On Fri, May 09, 2014 at 03:42:01AM +0100, Filipe David Borba Manana wrote:
>> When running low on available disk space and having several processes
>> doing buffered file IO, I got the following trace in dmesg:
>>
>> [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 
>> seconds.
>> [ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
>> [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 
>> 0x00000000
>> [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
>> [ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 
>> 00000000001d3f40
>> [ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 
>> ffff880203ac2490
>> [ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 
>> 0000000000000001
>> [ 4202.720922] Call Trace:
>> [ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
>> [ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 
>> [btrfs]
>> [ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
>> [ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 
>> [btrfs]
>> [ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
>> [ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
>> (...)
>> [ 4202.721027] 2 locks held by kworker/u8:1/5450:
>> [ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] 
>> process_one_work+0x193/0x530
>> [ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: 
>> [<ffffffff81068083>] process_one_work+0x193/0x530
>> [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
>> [ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
>> [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 
>> 0x00000001
>> [ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 
>> 00000000001d3f40
>> [ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 
>> ffff8800a33d2490
>> [ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 
>> 7fffffffffffffff
>> [ 4202.721718] Call Trace:
>> [ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
>> [ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
>> [ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
>> [ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
>> [ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
>> [ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
>> [ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
>> [ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 
>> [btrfs]
>> [ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 
>> [btrfs]
>> [ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
>> [ 4202.721799]  [<ffffffffa02056a9>] 
>> btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
>> [ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 
>> [btrfs]
>> (...)
>>
>> It turns out that extent_io.c:__extent_writepage(), which ends up being 
>> called
>> through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was 
>> getting
>> -ENOSPC when calling the fill_delalloc callback. In this situation, it 
>> returned
>> without the writepage_end_io_hook callback 
>> (inode.c:btrfs_writepage_end_io_hook)
>> ever being called for the respective page, which prevents the ordered 
>> extent's
>> bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
>> is never queued into the endio_write_workers queue. This makes the task that
>> called btrfs_start_ordered_extent() hang forever on the wait queue of the 
>> ordered
>> extent.
>
> Nice fix, I've comfirmed the hang on my box, but things could be more complex,
> more below,
>
>>
>> This is fairly easy to reproduce using a small filesystem and fsstress on
>> a quad core vm:
>>
>>     mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
>>     mount /dev/sdd /mnt
>>
>>     fsstress -p 6 -d /mnt -n 100000 -x \
>>         "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
>>           -f allocsp=0 \
>>           -f bulkstat=0 \
>>           -f bulkstat1=0 \
>>           -f chown=0 \
>>           -f creat=1 \
>>           -f dread=0 \
>>           -f dwrite=0 \
>>           -f fallocate=1 \
>>           -f fdatasync=0 \
>>           -f fiemap=0 \
>>           -f freesp=0 \
>>           -f fsync=0 \
>>           -f getattr=0 \
>>           -f getdents=0 \
>>           -f link=0 \
>>           -f mkdir=0 \
>>           -f mknod=0 \
>>           -f punch=1 \
>>           -f read=0 \
>>           -f readlink=0 \
>>           -f rename=0 \
>>           -f resvsp=0 \
>>           -f rmdir=0 \
>>           -f setxattr=0 \
>>           -f stat=0 \
>>           -f symlink=0 \
>>           -f sync=0 \
>>           -f truncate=1 \
>>           -f unlink=0 \
>>           -f unresvsp=0 \
>>           -f write=4
>>
>> Signed-off-by: Filipe David Borba Manana <[email protected]>
>> ---
>>
>> V2: Grab error from `em` pointer if available, do the error check and
>>     end_extent_writepage call before unlocking the page (just like
>>     end_bio_extent_writepage does).
>>
>>  fs/btrfs/extent_io.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 0c43896..b0c73ad 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -3284,6 +3284,7 @@ static int __extent_writepage(struct page *page, 
>> struct writeback_control *wbc,
>>                                    end - cur + 1, 1);
>>               if (IS_ERR_OR_NULL(em)) {
>>                       SetPageError(page);
>> +                     ret = PTR_ERR_OR_ZERO(em);
>>                       break;
>>               }
>>
>> @@ -3370,13 +3371,17 @@ done:
>>               set_page_writeback(page);
>>               end_page_writeback(page);
>>       }
>> +     if (PageError(page)) {
>> +             ret = ret < 0 ? ret : -EIO;
>> +             end_extent_writepage(page, ret, start, page_end);
>> +     }
>>       unlock_page(page);
>>
>>  done_unlocked:
>>
>>       /* drop our reference on any cached states */
>>       free_extent_state(cached_state);
>> -     return 0;
>> +     return ret;
>
> So 'return ret' can makes extent_write_cache_pages() stop after this
> __extent_writepage, and if the page happens to be the last one of the current
> pagevec of the for() loop that extent_write_cache_pages() is using, the rest
> pages will never get be processed.
>
> Try to image this, on lack of space, we'd get ENOSPC on btrfs_reserve_extent()
> which is in cow_file_range() -- a normal choice for run_delalloc_range(), and
> for page N, it's the locked_page we use in cow_file_range(), and the delalloc
> range also involves other pages that are after page N.  Now we get ENOSPC on
> allocating space for page N+2 rather the page N, which means page N and page 
> N+1
> have been on the ordered list, then as this patch does, we bail out and return
> the -ENOSPC error to caller, now if the above situation occurs, page N+1 will 
> not
> be processed, then the page N+1 's ordered extent will not be finished either
> since we don't submit that page at all.
>
> However, this is a rare case, I don't run into that(I guess that's because it
> has low possibility that the page happens to be the last one of pagevec).
>
> So my suggestion is, let this locked page go for the normal submit_bio path if
> it has been set up properly for ordered write.  Those pages that don't get 
> space
> in cow_file_range() has been cleaned up through the error handling path, but
> what's lacked here is that these pages are not marked via SetPageError, we 
> might
> need to add that in extent_clear_unlock_delalloc().
>
> (but I could be wrong, so cc maintainers for ideas)

I think you're right.
I'm testing this incremental diff on top:
https://friendpaste.com/6eWAVqAjsqc6ZxpEJpvLH5
Will integrate it after finishing some tests.

Thanks Liu.

>
> thanks,
> -liubo
>
>>  }
>>
>>  static int eb_wait(void *word)
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to