Re: [PATCH v1] fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages

Dan Williams Thu, 10 Apr 2025 13:15:51 -0700

David Hildenbrand wrote:
> Alison reports an issue with fsdax when large extends end up using
> large ZONE_DEVICE folios:
> 
> [  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
> [  417.796982] #PF: supervisor read access in kernel mode
> [  417.797540] #PF: error_code(0x0000) - not-present page
> [  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
> [  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
> [  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
> [  417.800150] Tainted: [O]=OOT_MODULE
> [  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> [  417.801948] Code: ...
> [  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
> [  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 
> 0000000000000002
> [  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: 
> ffffffff82a2beae
> [  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 
> 0000000000000000
> [  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 
> 0000000000000001
> [  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: 
> ffff888029210580
> [  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) 
> knlGS:0000000000000000
> [  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 
> 0000000000370ef0
> [  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  417.811353] Call Trace:
> [  417.811709]  <TASK>
> [  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
> [  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
> [  417.813132]  insert_page+0x78/0xf0
> [  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
> [  417.814088]  dax_fault_iter+0x484/0x7b0
> [  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
> [  417.815055]  dax_iomap_fault+0x39/0x40
> [  417.815499]  __xfs_write_fault+0x139/0x380
> [  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
> [  417.816483]  xfs_write_fault+0x41/0x50
> [  417.816966]  xfs_filemap_fault+0x3b/0xe0
> [  417.817424]  __do_fault+0x31/0x180
> [  417.817859]  __handle_mm_fault+0xee1/0x1a60
> [  417.818325]  ? debug_smp_processor_id+0x17/0x20
> [  417.818844]  handle_mm_fault+0xe1/0x2b0
> [...]
> 
> The issue is that when we split a large ZONE_DEVICE folio to order-0
> ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays
> page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
> has folio->memcg_data set. And we never manually initialize
> folio->memcg_data in fsdax code, because we never expect it to be set at
> all.
> 
> When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to
> use folio->memcg_data (because it's non-NULL) but it does not actually
> point at a memcg, resulting in the problem.
> 
> Alison also observed that these folios sometimes have "locked"
> set, which is rather concerning (folios locked from the beginning ...).
> The reason is that the order for large folios is stored in page[1]->flags,
> which become the folio->flags of a new small folio.
> 
> Let's fix it by adding a folio helper to clear order/_nr_pages for
> splitting purposes.
> 
> Maybe we should reinitialize other large folio flags / folio members as
> well when splitting, because they might similarly cause harm once
> page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should
> not be set for fsdax, so at least page[1]->flags might be as expected with
> this fix.
> 
> From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
> re-initialize most things from a previous page[1] used by large folios
> that fsdax cares about. For example folio->private might not get
> reinitialized, but maybe that's not relevant -- no traces of it's use in
> fsdax code. Needs a closer look.
> 
> Another thing that should be considered in the future is performing similar
> checks as we perform in free_tail_page_prepare() -- checking pincount etc.
> -- when freeing a large fsdax folio.
> 
> Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first 
> tail page")
> Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
> Reported-by: Alison Schofield <alison.schofi...@intel.com>
> Closes: https://lkml.kernel.org/r/z_w9oeg-d9fhi...@aschofie-mobl2.lan
> Cc: Alexander Viro <v...@zeniv.linux.org.uk>
> Cc: Christian Brauner <brau...@kernel.org>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Cc: Matthew Wilcox <wi...@infradead.org>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Alistair Popple <apop...@nvidia.com>
> Cc: Christoph Hellwig <h...@infradead.org>
> Signed-off-by: David Hildenbrand <da...@redhat.com>
> ---
>  fs/dax.c           |  1 +
>  include/linux/mm.h | 17 +++++++++++++++++
>  2 files changed, 18 insertions(+)


Explanation excellent, folio_reset_order() looks correct to me and the
callsite in fsdax looks correct.

Reviewed-by: Dan Williams <dan.j.willi...@intel.com>

For consistency and clarity what about this incremental change, to make
the __split_folio_to_order() path reuse folio_reset_order(), and use
typical bitfield helpers for manipulating _flags_1?


diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf55206935c4..5b614d31f4f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -33,6 +33,7 @@
 #include <linux/slab.h>
 #include <linux/cacheinfo.h>
 #include <linux/rcuwait.h>
+#include <linux/bitfield.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -1171,7 +1172,7 @@ extern void prep_compound_page(struct page *page, 
unsigned int order);
 
 static inline unsigned int folio_large_order(const struct folio *folio)
 {
-       return folio->_flags_1 & 0xff;
+       return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1);
 }
 
 #ifdef NR_PAGES_IN_LARGE_FOLIO
@@ -1229,7 +1230,8 @@ static inline void folio_reset_order(struct folio *folio)
 {
        if (WARN_ON_ONCE(!folio_test_large(folio)))
                return;
-       folio->_flags_1 &= ~0xffUL;
+       ClearPageCompound(&folio->page);
+       folio->_flags_1 &= ~FOLIO_ORDER_MASK;
 #ifdef NR_PAGES_IN_LARGE_FOLIO
        folio->_nr_pages = 0;
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..3dc2d98fde24 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -483,6 +483,8 @@ struct folio {
        };
 };
 
+#define FOLIO_ORDER_MASK GENMASK(7, 0)
+
 #define FOLIO_MATCH(pg, fl)                                            \
        static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
 FOLIO_MATCH(flags, flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..301ca9459122 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, 
int old_order,
        if (new_order)
                folio_set_order(folio, new_order);
        else
-               ClearPageCompound(&folio->page);
+               folio_reset_order(folio);
 }
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index 50c2f590b2d0..41a4d2b66405 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -727,7 +727,8 @@ static inline void folio_set_order(struct folio *folio, 
unsigned int order)
        if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
                return;
 
-       folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
+       folio->_flags_1 &= ~FOLIO_ORDER_MASK;
+       folio->_flags_1 |= FIELD_PREP(FOLIO_ORDER_MASK, order);
 #ifdef NR_PAGES_IN_LARGE_FOLIO
        folio->_nr_pages = 1U << order;
 #endif

Re: [PATCH v1] fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages

Reply via email to