Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

Qu Wenruo Mon, 19 Apr 2021 00:20:22 -0700



On 2021/4/19 下午2:16, Qu Wenruo wrote:



On 2021/4/19 下午1:59, riteshh wrote:

On 21/04/16 10:22PM, riteshh wrote:
On 21/04/16 02:14PM, Qu Wenruo wrote:
On 2021/4/16 下午1:50, riteshh wrote:
On 21/04/16 09:34AM, Qu Wenruo wrote:
On 2021/4/16 上午7:34, Qu Wenruo wrote:
On 2021/4/16 上午7:19, Qu Wenruo wrote:
On 2021/4/15 下午10:52, riteshh wrote:
On 21/04/15 09:14AM, riteshh wrote:
On 21/04/12 07:33PM, Qu Wenruo wrote:
Good news, you can fetch the subpage branch for better testresults.
Now the branch should pass all generic tests, except defragandknown
failures.
And no more random crash during the tests.
Thanks, let me test it on PPC64 box.
I do see some failures remaining with the patch series.
However the one which is blocking my testing is thetests/generic/095
I see kernel BUG hitting with below signature.
That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.
Please let me know if this a known failure?

<xfstests config>
#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION       -- btrfs_4k
FSTYP         -- btrfs
PLATFORM -- Linux/ppc64le qemu5.12.0-rc7-02316-g3490dae50c0 #73
SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3
I see you're using -n 4096, not the default -n 16K, let me seeif I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related tosectorsize.
MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch


<kernel logs>
[ 6057.560580] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching isenabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector
size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid43ec9cdf-c124-4460-ad93-933bfd5ddbbd
devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching isenabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) &&page->private,
in fs/btrfs/subpage.c:171
[ 6061.378296] ------------[ cut here ]------------
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c0000000260d7490]
       pc: c000000000a9370c: assertfail.constprop.11+0x34/0x48
       lr: c000000000a93708: assertfail.constprop.11+0x30/0x48
       sp: c0000000260d7730
      msr: 800000000282b033
     current = 0xc0000000260c0080
paca = 0xc00000003fff8a00 irqmask: 0x03irq_happened: 0x01
       pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@xxxx) (gcc
(Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils forUbuntu)
2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c0000000260d7790] c000000000a90280
btrfs_subpage_assert.isra.9+0x70/0x110
[c0000000260d77b0] c000000000a91064
btrfs_subpage_set_uptodate+0x54/0x110
[c0000000260d7800] c0000000009c6d0c btrfs_dirty_pages+0x1bc/0x2c0
This is very strange.
As in btrfs_dirty_pages(), the pages passed in are alreadyprepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?
Yes. almost reliably on my PPC box.
OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool,thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.
This exposed an interesting race window in btrfs_buffered_write():
          Writer                    |             fadvice
----------------------------------+-------------------------------
btrfs_buffered_write()            |
|- prepare_pages()                |
|  |- Now all pages involved get  |
|     Private set                 |
|                                 | btrfs_release_page()
|                                 | |- Clear page Private
|- lock_extent()                  |
|  |- This would prevent          |
|     btrfs_release_page() to     |
|     clear the page Private      |
|
|- btrfs_dirty_page()
     |- Will trigger the BUG_ON()
Sorry about the silly query. But help me understand how is aboveracepossible?Won't prepare_pages() will lock all the pages first. The samerequirement
of locked page should be with btrfs_releasepage() too no?
releasepage() call can easily got a page locked and release it.

For call sites like btrfs_invalidatepage(), the page is already locked.
btrfs_releasepage() will not to try to release the page if theextent is
locked (any extent range inside the page has EXTENT_LOCK bit).
I see only two paths which could result into btrfs_releasepage()
1. one via try_to_release_pages -> releasepage()
This is the race one, called from fadvice() to release pages.
2. writeback path calling btrfs_writepage or btrfs_writepages
    which may result into calling of btrfs_invalidatepage()
Not this one.
Although I am not sure which one this is racing with.
This only happens for subpage, because subpage introduces newASSERT()
to do extra check.

If we want to speak strictly, regular sector size should also report
this problem.
But regular sector size case doesn't really care about pagePrivate,asit just set page->private to a constant value, unlike subpage casewhich
stores important value.

The fix will just re-set page Private and needed structures in
btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
able to release it anymore.
With above fix I see a different issue with below signature.
[ 130.272410] BTRFS warning (device loop2): read-write for sectorsize 4096 with page size 65536 is experimental
[  130.387470] run fstests generic/095 at 2021-04-16 05:04:09
[ 132.042532] BTRFS: device fsid642daee0-165a-4271-b6f3-728f215c5348 devid 1 transid 5 /dev/loop3scanned by mkfs.btrfs (5226)[ 132.146892] BTRFS info (device loop3): disk space caching isenabled
[  132.147831] BTRFS info (device loop3): has skinny extents
[ 132.148491] BTRFS warning (device loop3): read-write for sectorsize 4096 with page size 65536 is experimental
[  132.158228] BTRFS info (device loop3): checking UUID tree
[  133.931695] BUG: spinlock bad magic on CPU#4, swapper/4/0
[ 133.932874] BUG: Unable to handle kernel data access on write at0x6b6b6b6b6b6b725b
That looks like some poisoned memory.
I have run 128 runs of generic/095 locally on my Arm board duringthe fix,
unable to reproduce the crash anymore.
And this call site is even harder to get race, as in endio context,the page
still has PageWriteback until the last bio finished in the page.
This means btrfs_releasepage() will not even try to release thepage, whilebtrfs_invalidatepage() will wait the page to finish its writebackbefore
doing anything.

So this is very strange to me.
Any reproducibility on your side? Or something specific to Power isrelatedto this case? (IIRC some page flag operation is not atomic, maybethatis
related?)
I doubt if this is Power related. And yes, I can reproduce the issuefairlyeasily. For now I will exclude the test from my run to get a overallrun with
Here, are some other failures that I noticed during testing on Power.
Thanks for looking into this.


Thank you very much for the extra test!

1. tests/btrfs/052
btrfs/052 [failed, exit status 1]- output mismatch (see/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad)
     --- tests/btrfs/052.out     2020-08-04 09:59:08.328299552 +0000
+++/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad 2021-04-1617:18:17.762928432 +0000
     @@ -91,553 +91,5 @@
      23 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
      *
      30
     -0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
     -*
     -2 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
     -*
     ...
(Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/052.out/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/052.out.bad'to see the entire diff)
^^^ this could also be due to below error found in 052.full
ERROR: defrag range ioctl not supported in this kernel version,2.6.33 and newer is required
    total 1 failures
failed: '/usr/local/bin/btrfs filesystem defragment/mnt1/scratch/foo'
2. tests/btrfs/076 => looks a genuine failure.
btrfs/076 - output mismatch (see/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad)
     --- tests/btrfs/076.out     2020-08-04 09:59:08.338299786 +0000
+++/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad 2021-04-1617:19:33.344981383 +0000
     @@ -1,3 +1,3 @@
      QA output created by 076
     -80
     -80
     +1
     +1
     ...
(Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/076.out/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/076.out.bad'to see the entire diff)


This is really a compression related one. Since I hardcoded to disable
compression, the ratio is always be 1.

3. tests/btrfs/106  => looks a genuine failure.
btrfs/106 - output mismatch (see/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad)
     --- tests/btrfs/106.out     2020-08-04 09:59:08.348300020 +0000
+++/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad 2021-04-1617:49:27.296128823 +0000
     @@ -5,19 +5,19 @@
      File contents before unmount:
      0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
      *
     -40
     +1000
      File contents after remount:
      0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
     ...
(Run 'diff -u /home/qemu/work-tools/xfstests/tests/btrfs/106.out/home/qemu/work-tools/xfstests/results//btrfs_4k/btrfs/106.out.bad'to see the entire diff)


That's a similar problem, compression needed
while compression is hard coded to be disable, thus clone reports
different value.

these patches. Later will try and debug what is going on.

But if you need any debug logs - do let me know, as it is fairly easily
reproducible.

For tests/generic/095 can you pls retry reproducing the issue (withyourlatest

patch) on your setup with below configs enabled?
1. CONFIG_PAGE_OWNER, CONFIG_PAGE_POISONING, CONFIG_SLUB_DEBUG_ON,

CONFIG_SCHED_STACK_END_CHECK, CONFIG_DEBUG_VM,CONFIG_DEBUG_STACKOVERFLOW,

    CONFIG_DEBUG_VM_PGFLAGS, CONFIG_DEBUG_SPINLOCK, CONFIG_PROVE_LOCKING


Thanks, I'll retry using the extra debugging options.

But I have a more solid explanation on why the bug happens now.

You're right, prepare_pages() should have the page locked by calling
find_or_create_page(), so btrfs_releasepage() shouldn't sneak in and
just release the page.

But there is a small window in prepare_uptodate_page(), where we may
call btrfs_readpage(), which will unlock the page.

So there is a window where we have page unlocked, before we re-lock it
in prepare_uptodate_page().

By that, we got a page with its Private bit cleared.

I'm trying a better fix like the following diff.
But I'm not yet 100% confident if the PagePrivate() check is enough,
thus I'll do more test before sending the proper fix.

Thanks,
Qu

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 45ec3f5ef839..49f78d643392 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c

@@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode*inode,

                        unlock_page(page);
                        return -EIO;
                }
-               if (page->mapping != inode->i_mapping) {
+
+               /*
+                * Since btrfs_readpage() will get the page unlocked, we
have
+                * a window where fadvice() can try to release the page.
+                * Here we check both inode mapping and PagePrivate() to
+                * make sure the page is not released.
+                *
+                * The priavte flag check is essential for subpage as we
need
+                * to store extra bitmap using page->private.
+                */
+               if (page->mapping != inode->i_mapping ||
PagePrivate(page)) {

  ^ Obviously it should be !PagePrivate(page).

Thanks,
Qu

                        unlock_page(page);
                        return -EAGAIN;
                }



-ritesh

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

Reply via email to