On Thu, Dec 17, 2020 at 5:03 AM Qu Wenruo <[email protected]> wrote:
>
> [BUG]
> With current subpage RW patchset, the following script can lead to
> filesystem hang:
>   # mkfs.btrfs -f -s 4k $dev
>   # mount $dev -o nospace_cache $mnt
>   # fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
>
> The file system will hang at wait_event() of
> btrfs_start_ordered_extent().
>
> [CAUSE]
> The root cause is, btrfs_invalidatepage() is freeing page::private which
> still has subpage dirty bit set.
>
> The offending situation happens like this:
> btrfs_fllocate()
> |- btrfs_zero_range()
>    |- btrfs_punch_hole_lock_range()
>       |- truncate_pagecache_range()
>          |- btrfs_invalidatepage()
>
> The involved range looks like:
>
> 0       32K     64K     96K     128K
>         |///////||//////|
>         | Range to drop |
>
> For the [32K, 64K) range, since the offset is 32K, the page won't be
> invalidated.
>
> But for the [64K, 96K) range, the offset is 0, current
> btrfs_invalidatepage() will call clear_page_extent_mapped() which will
> detach page::private, making the subpage dirty bitmap being cleared.
>
> This prevents later __extent_writepage_io() to locate any range to
> write, thus no way to wake up the ordered extents.
>
> [FIX]
> To fix the problem this patch will:
> - Only clear page status and detach page private when the full page
>   is invalidated
>
> - Change how we handle unfinished ordered extent
>   If there is any ordered extent unfinished in the page range, we can't
>   call clear_extent_bit() with delete == true.
>
> [REASON FOR RFC]
> There is still uncertainty around the btrfs_releasepage() call.
>
> 1. Why we need btrfs_releasepage() call for non-full-page condition?
>    Other fs (aka. xfs) just exit without doing special handling if
>    invalidatepage() is called with part of the page.
>
>    Thus I didn't completely understand why btrfs_releasepage() here is
>    needed for non-full page call.
>
> 2. Why "if (offset)" is not causing problem for current code?
>    This existing if (offset) call can be skipped for cases like
>    offset == 0 length == 2K.
>    As MM layer can call invalidatepage() with unaligned offset/length,
>    for cases like truncate_inode_pages_range().
>    This will make btrfs_invalidatepage() to truncate the whole page when
>    we only need to zero part of the page.

I don't think you can ever get offset == 0 and length < PAGE_SIZE
unless this is the last page in the file, the one containing eof, in
which it's perfectly valid to invalidate the whole page.

>
> Signed-off-by: Qu Wenruo <[email protected]>
> ---
>  fs/btrfs/inode.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index eb493fbb65f9..872c5309b4ca 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8180,7 +8180,7 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>         int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
>         bool cleared_private2;
>         bool found_ordered = false;
> -       bool completed_ordered = false;
> +       bool incompleted_ordered = false;
>
>         /*
>          * we have the page locked, so new writeback can't start,
> @@ -8191,7 +8191,13 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>          */
>         wait_on_page_writeback(page);
>
> -       if (offset) {
> +       /*
> +        * The range doesn't cover the full page, just let 
> btrfs_releasepage() to
> +        * check if we can release the extent mapping.
> +        * Any locked/pinned/logged extent map would prevent us freeing the
> +        * extent mapping.
> +        */
> +       if (!(offset == 0 && length == PAGE_SIZE)) {
>                 btrfs_releasepage(page, GFP_NOFS);
>                 return;
>         }
> @@ -8208,9 +8214,10 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>                 end = min(page_end,
>                           ordered->file_offset + ordered->num_bytes - 1);
>                 /*
> -                * IO on this page will never be started, so we need to 
> account
> -                * for any ordered extents now. Don't clear 
> EXTENT_DELALLOC_NEW
> -                * here, must leave that up for the ordered extent completion.
> +                * IO on this ordered extent will never be started, so we need
> +                * to account for any ordered extents now. Don't clear

So this comment update states that "IO on this ordered extent will
never be started".
That is not true, unless some other patch not in misc-next changed
something and I missed it (like splitting ordered extents).

If you have a 1M ordered extent, for file range [0, 1M[ for example,
and then truncate the file to 512K or punch a hole for the range
[512K, 1M[, then IO for the first 512K of the ordered extent is still
done.

So I think what you wanted to say is more like "IO for this subpage
will never be started ...".

> +                * EXTENT_DELALLOC_NEW here, must leave that up for the
> +                * ordered extent completion.
>                  */
>                 if (!inode_evicting)
>                         clear_extent_bit(tree, start, end,
> @@ -8234,7 +8241,8 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>                                                            start,
>                                                            end - start + 1, 
> 1)) {
>                                 btrfs_finish_ordered_io(ordered);
> -                               completed_ordered = true;
> +                       } else {
> +                               incompleted_ordered = true;
>                         }
>                 }
>
> @@ -8276,7 +8284,7 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>                  * is cleared if we don't delete, otherwise it can lead to
>                  * corruptions if the i_size is extented later.
>                  */
> -               if (found_ordered && !completed_ordered)
> +               if (found_ordered && incompleted_ordered)

I find this naming, "incompleted_ordered" confusing, I think
"incompleted" is not even a valid english word.

What you mean is that if there is some ordered extent for the page
range that we could not complete ourselves.
I would suggest naming it to "completed_all_ordered", initialize it to
true and then set it to false when we can't complete an ordered extent
ourselves.

Then it would just be "if (found_ordered && !completed_all_ordered)
delete = false;".

Also, I haven't checked the other patchsets for subpage support, but
from looking only at this patchset, I'm assuming we can't set ranges
in the io tree smaller than page size, is that correct?
Otherwise this would be calling clear_extent_bit for each subpage range.

Thanks.

>                         delete = false;
>                 clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
>                                  EXTENT_DELALLOC | EXTENT_UPTODATE |
> @@ -8286,6 +8294,7 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>                 __btrfs_releasepage(page, GFP_NOFS);
>         }
>
> +       ClearPagePrivate2(page);
>         ClearPageChecked(page);
>         clear_page_extent_mapped(page);
>  }
> --
> 2.29.2
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

Reply via email to