On 2020/12/17 下午7:20, Filipe Manana wrote:
On Thu, Dec 17, 2020 at 5:03 AM Qu Wenruo <[email protected]> wrote:

[BUG]
With current subpage RW patchset, the following script can lead to
filesystem hang:
   # mkfs.btrfs -f -s 4k $dev
   # mount $dev -o nospace_cache $mnt
   # fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The file system will hang at wait_event() of
btrfs_start_ordered_extent().

[CAUSE]
The root cause is, btrfs_invalidatepage() is freeing page::private which
still has subpage dirty bit set.

The offending situation happens like this:
btrfs_fllocate()
|- btrfs_zero_range()
    |- btrfs_punch_hole_lock_range()
       |- truncate_pagecache_range()
          |- btrfs_invalidatepage()

The involved range looks like:

0       32K     64K     96K     128K
         |///////||//////|
         | Range to drop |

For the [32K, 64K) range, since the offset is 32K, the page won't be
invalidated.

But for the [64K, 96K) range, the offset is 0, current
btrfs_invalidatepage() will call clear_page_extent_mapped() which will
detach page::private, making the subpage dirty bitmap being cleared.

This prevents later __extent_writepage_io() to locate any range to
write, thus no way to wake up the ordered extents.

[FIX]
To fix the problem this patch will:
- Only clear page status and detach page private when the full page
   is invalidated

- Change how we handle unfinished ordered extent
   If there is any ordered extent unfinished in the page range, we can't
   call clear_extent_bit() with delete == true.

[REASON FOR RFC]
There is still uncertainty around the btrfs_releasepage() call.

1. Why we need btrfs_releasepage() call for non-full-page condition?
    Other fs (aka. xfs) just exit without doing special handling if
    invalidatepage() is called with part of the page.

    Thus I didn't completely understand why btrfs_releasepage() here is
    needed for non-full page call.

2. Why "if (offset)" is not causing problem for current code?
    This existing if (offset) call can be skipped for cases like
    offset == 0 length == 2K.
    As MM layer can call invalidatepage() with unaligned offset/length,
    for cases like truncate_inode_pages_range().
    This will make btrfs_invalidatepage() to truncate the whole page when
    we only need to zero part of the page.

I don't think you can ever get offset == 0 and length < PAGE_SIZE
unless this is the last page in the file, the one containing eof, in
which it's perfectly valid to invalidate the whole page.

You're right.

After more testing, it indeed shows except setsize call which could pass
unaligned range, all other call sites are using sector aligned ranges.

Thus the existing code won't cause any problem for current code base.

Either we're invalidating the last page of the inode, or we're
invalidating sector (PAGE) aligned range.


But for subpage support, we can have cases like
btrfs_punch_hole_lock_range() which only passes sector aligned range in,
and since sector size is smaller than page size, we can have offset == 0
while length < PAGE_SIZE and it's not the last page.

Further more, the page can have dirty range not covered by the
invalidatepage range, causing the problem.

I'll update the commit message to explain the case more, and only put
the fix into the subpage series, other than sending it out without
subpage context.



Signed-off-by: Qu Wenruo <[email protected]>
---
  fs/btrfs/inode.c | 23 ++++++++++++++++-------
  1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eb493fbb65f9..872c5309b4ca 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8180,7 +8180,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
         int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
         bool cleared_private2;
         bool found_ordered = false;
-       bool completed_ordered = false;
+       bool incompleted_ordered = false;

         /*
          * we have the page locked, so new writeback can't start,
@@ -8191,7 +8191,13 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
          */
         wait_on_page_writeback(page);

-       if (offset) {
+       /*
+        * The range doesn't cover the full page, just let btrfs_releasepage() 
to
+        * check if we can release the extent mapping.
+        * Any locked/pinned/logged extent map would prevent us freeing the
+        * extent mapping.
+        */
+       if (!(offset == 0 && length == PAGE_SIZE)) {
                 btrfs_releasepage(page, GFP_NOFS);
                 return;
         }
@@ -8208,9 +8214,10 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
                 end = min(page_end,
                           ordered->file_offset + ordered->num_bytes - 1);
                 /*
-                * IO on this page will never be started, so we need to account
-                * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
-                * here, must leave that up for the ordered extent completion.
+                * IO on this ordered extent will never be started, so we need
+                * to account for any ordered extents now. Don't clear

So this comment update states that "IO on this ordered extent will
never be started".
That is not true, unless some other patch not in misc-next changed
something and I missed it (like splitting ordered extents).

If you have a 1M ordered extent, for file range [0, 1M[ for example,
and then truncate the file to 512K or punch a hole for the range
[512K, 1M[, then IO for the first 512K of the ordered extent is still
done.

So I think what you wanted to say is more like "IO for this subpage
will never be started ...".

This is indeed much better.


+                * EXTENT_DELALLOC_NEW here, must leave that up for the
+                * ordered extent completion.
                  */
                 if (!inode_evicting)
                         clear_extent_bit(tree, start, end,
@@ -8234,7 +8241,8 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
                                                            start,
                                                            end - start + 1, 
1)) {
                                 btrfs_finish_ordered_io(ordered);
-                               completed_ordered = true;
+                       } else {
+                               incompleted_ordered = true;
                         }
                 }

@@ -8276,7 +8284,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
                  * is cleared if we don't delete, otherwise it can lead to
                  * corruptions if the i_size is extented later.
                  */
-               if (found_ordered && !completed_ordered)
+               if (found_ordered && incompleted_ordered)

I find this naming, "incompleted_ordered" confusing, I think
"incompleted" is not even a valid english word.

What you mean is that if there is some ordered extent for the page
range that we could not complete ourselves.
I would suggest naming it to "completed_all_ordered", initialize it to
true and then set it to false when we can't complete an ordered extent
ourselves.

Then it would just be "if (found_ordered && !completed_all_ordered)
delete = false;".

Also, I haven't checked the other patchsets for subpage support, but
from looking only at this patchset, I'm assuming we can't set ranges
in the io tree smaller than page size, is that correct?
Otherwise this would be calling clear_extent_bit for each subpage range.

For current subpage implementation, we have to support sector aligned
writeback.

The requirement comes from data balance, we have to be able to write new
data extents exactly the same size as the originals.

Thus here we support range smaller than page size for subpage.
The extent io tree itself can support it without problem.

Thanks,
Qu

Thanks.

                         delete = false;
                 clear_extent_bit(tree, page_start, page_end, EXTENT_LOCKED |
                                  EXTENT_DELALLOC | EXTENT_UPTODATE |
@@ -8286,6 +8294,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
                 __btrfs_releasepage(page, GFP_NOFS);
         }

+       ClearPagePrivate2(page);
         ClearPageChecked(page);
         clear_page_extent_mapped(page);
  }
--
2.29.2



Reply via email to