On 2021/4/2 上午9:39, Anand Jain wrote:
On 29/03/2021 10:01, Qu Wenruo wrote:

On 2021/3/29 上午4:02, Ritesh Harjani wrote:
On 21/03/25 09:16PM, Qu Wenruo wrote:

On 2021/3/25 下午8:20, Neal Gompa wrote:
On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo <w...@suse.com> wrote:

This patchset can be fetched from the following github repo, along
the full subpage RW support:

This patchset is for metadata read write support.

Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
    Since current defrag is doing per-page defrag, to support subpage
    defrag, we need some change in the loop.
    E.g. if a page has both hole and regular extents in it, then
    will rewrite the full 64K page.

    Thus for now, defrag related failure is expected.
    But this should only cause behavior difference, no crash nor

- No compression support yet
    There are at least 2 known bugs if forcing compression for
    * Some hard coded PAGE_SIZE screwing up space rsv
    * Subpage ASSERT() triggered
      This is because some compression code is unlocking
locked_page by
      calling extent_clear_unlock_delalloc() with locked_page ==
    So for now compression is also disabled.

- Inode nbytes mismatch
    Still debugging.
    The fastest way to trigger is fsx using the following parameters:

      fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file >

    Which would cause inode nbytes differs from expected value and
    triggers btrfs check error.

The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size

- No more page locking
    Now metadata read/write relies on extent io tree locking,
other than
    page locking.
    This is to allow behaviors like read lock one eb while also
try to
    read lock another eb in the same page.
    We can't rely on page lock as now we have multiple extent
    the same page.

- Page status update
    Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
    Instead of just grabbing extent buffer from page::private, we
need to
    iterate all dirty extent buffers in the page and submit them.

- Rebased to latest misc-next
    No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
    This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
    No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
    Patch 2~3, originally in RW part.

- Fix one uninitialized variable
    Patch 6.

- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
    This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
    btrfs: add sysfs interface for supported sectorsize
    btrfs: use min() to replace open-code in btrfs_invalidatepage()
    btrfs: remove unnecessary variable shadowing in
    btrfs: refactor how we iterate ordered extent in
    btrfs: introduce helpers for subpage dirty status
    btrfs: introduce helpers for subpage writeback status
    btrfs: allow btree_set_page_dirty() to do more sanity check on
    btrfs: support subpage metadata csum calculation at write time
    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
    btrfs: make the page uptodate assert to be subpage compatible
    btrfs: make set/clear_extent_buffer_dirty() to be subpage
    btrfs: make set_btree_ioerr() accept extent buffer and to be
    btrfs: add subpage overview comments

   fs/btrfs/disk-io.c   | 143
   fs/btrfs/extent_io.c | 127 ++++++++++++++++++++++++++++----------
   fs/btrfs/inode.c     | 128 ++++++++++++++++++++++----------------
   fs/btrfs/subpage.c   | 127 ++++++++++++++++++++++++++++++++++++++
   fs/btrfs/subpage.h   |  17 +++++
   fs/btrfs/sysfs.c     |  15 +++++
   6 files changed, 441 insertions(+), 116 deletions(-)


Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.

The metadata part is much more stable than data path (almost not
for several months), and the metadata part already has some difference
in its behavior, which needs review.

You point makes some sense, but I still don't believe pushing a super
large patchset does any help for the review.

If you want to test, you can grab the branch from the github repo.
If you want to review, the mails are all here for review.

In fact, we used to have subpage support sent as a big patchset from
guys, but the result is only some preparation patches get merged, and
nothing more.

Using this multi-series method, we're already doing better work and
received more testing (to ensure regular sectorsize is not affected at

Hi Qu Wenruo,

Sorry about chiming in late on this. I don't have any strong
objection on either
approach. Although sometime back when I tested your RW support git
tree on
Power, the unmount patch itself was crashing. I didn't debug it that
(this was a month back or so), so I also didn't bother testing
xfstests on Power.

But we do have an interest in making sure this patch series work on
bs <ps
on Power platform. I can try helping with testing, reviewing (to best
knowledge) and fixing anything is possible :)

That's great!

One of my biggest problem here is, I don't have good enough testing

Although SUSE has internal clouds for ARM64/PPC64, but due to the
f**king Great Firewall, it's super slow to access, no to mention doing
proper debugging.

Currently I'm using two ARM SBCs, RK3399 and A311D based, to do the test.
But their computing power is far from ideal, only generic/quick can
finish in hours.

Thus real world Power could definitely help.

Let me try and pull your tree and test it on Power. Please let me
know if there
is anything needs to be taken care apart from your github tree and
branch with bs < ps support.

If you're going to test the branch, here are some small notes:

- Need to use latest btrfs-progs
  As it fixes a false alert on crossing 64K page boundary.

- Need to slightly modify btrfs-progs to avoid false alerts
  For subpage case, mkfs.btrfs will output a warning, but that warning
  is outputted into stderr, which will screw up generic test groups.
  It's recommended to apply the following diff:

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9..21976554 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -341,8 +341,8 @@ int btrfs_check_sectorsize(u32 sectorsize)
                return -EINVAL;
        if (page_size != sectorsize)
-               warning(
-"the filesystem may not be mountable, sectorsize %u doesn't match page
size %u",
+               printf(
+"the filesystem may not be mountable, sectorsize %u doesn't match page
size %u\n",
                        sectorsize, page_size);
        return 0;

- Xfstest/btrfs group will crash at btrfs/143
  Still investigating, but you can ignore btrfs group for now.

- Very rare hang
  There is a very low change to hang, with "bad ordered accounting"
  If you can hit, please let me know.
  I had something idea to fix it, but not yet in the branch.

- btrfs inode nbytes mismatch
  Investigating, as it will make btrfs-check to report error.

The last two bugs are the final show blocker, I'll give you extra
updates when those are fixed.

  I am running the tests on aarch64 here. Are fixes for these known
  issues posted in the ML? I can't see them yet.

Not yet, even in my subpage branch.

The problem here is completely in btrfs_invalidatepate() race against
writepage endio.

The current problem is we're using page Private2 bit to indicate if
there is any pending ordered io to be finished.

But for subpage case, just single bit in page Private2 is no longer

The following case can happen:

        T1                      |               T2
Page [0, 16K) dirtied           |
Page [0, 16K) delalloc start    |
|- New ordered extent created   |
|- With PagePrivate2 set        |
[0, 16K) write page endio       |
|- Clear PagePrivate2           |
|- OE [0, 16K) IO_DONE          |
|- Queue finish_ordered_io()    |
   But OE [0, 16K) still in tree|
Page [16K, 32K) dirtied         |
Page [0, 16K) delalloc start    |
|- New ordered extent created   |
|- With PagePrivate2 set        |
                                | invalidatepage on [0, 64K)
                                | |- TestClearPagePrivate2
                                | |- Dec OE on range [0, 16k)
                                | |  |- Underflow OE [0, 16K) <<<
                                | |- Dec OE on range [16K, 32K)
                                |    |- This is proper dec

In above case, in invalidatepage(), Ordered Extent [0, 16K) should not
get decreased, as its endio has finished.

Normally we rely on PagePrivate2 to prevent such problem, but for
current subpage case it doesn't have bitmap for it, and causes the problem.

The invalidatepage() part is also responsible for the inode nbytes mismatch.

IMHO, the btrfs_dec_test_.*ordered_pending() API is pure garbage.
It require callers to handle the Private2 bit and do the loop, but it
should be completely integrated into ordered extent code, not exposing
those details for callers.

I'm currently reworking the involved APIs,
btrfs_dec_test_first_ordered_pending() has been converted to subpage
friendly one and passes tests for 4K page systems.

But the btrfs_dec_test_ordered_pending() in btrfs_invalidatepage() is a
much harder hassle to handle.
Will keep working on the problem in recent days to completely solve it,
then rebase all the subpage code on the refactor ordered extent code.


Thanks, Anand



Reply via email to