David Hildenbrand wrote: > Alison reports an issue with fsdax when large extends end up using > large ZONE_DEVICE folios: > > [ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00 > [ 417.796982] #PF: supervisor read access in kernel mode > [ 417.797540] #PF: error_code(0x0000) - not-present page > [ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0 > [ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI > [ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ... > [ 417.800150] Tainted: [O]=OOT_MODULE > [ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 > 02/06/2015 > [ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250 > [ 417.801948] Code: ... > [ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206 > [ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: > 0000000000000002 > [ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: > ffffffff82a2beae > [ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: > 0000000000000000 > [ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: > 0000000000000001 > [ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: > ffff888029210580 > [ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) > knlGS:0000000000000000 > [ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: > 0000000000370ef0 > [ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 417.811353] Call Trace: > [ 417.811709] <TASK> > [ 417.812038] folio_add_file_rmap_ptes+0x143/0x230 > [ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0 > [ 417.813132] insert_page+0x78/0xf0 > [ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0 > [ 417.814088] dax_fault_iter+0x484/0x7b0 > [ 417.814542] dax_iomap_pte_fault+0x1ca/0x620 > [ 417.815055] dax_iomap_fault+0x39/0x40 > [ 417.815499] __xfs_write_fault+0x139/0x380 > [ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60 > [ 417.816483] xfs_write_fault+0x41/0x50 > [ 417.816966] xfs_filemap_fault+0x3b/0xe0 > [ 417.817424] __do_fault+0x31/0x180 > [ 417.817859] __handle_mm_fault+0xee1/0x1a60 > [ 417.818325] ? debug_smp_processor_id+0x17/0x20 > [ 417.818844] handle_mm_fault+0xe1/0x2b0 > [...] > > The issue is that when we split a large ZONE_DEVICE folio to order-0 > ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays > page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it > has folio->memcg_data set. And we never manually initialize > folio->memcg_data in fsdax code, because we never expect it to be set at > all. > > When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to > use folio->memcg_data (because it's non-NULL) but it does not actually > point at a memcg, resulting in the problem. > > Alison also observed that these folios sometimes have "locked" > set, which is rather concerning (folios locked from the beginning ...). > The reason is that the order for large folios is stored in page[1]->flags, > which become the folio->flags of a new small folio. > > Let's fix it by adding a folio helper to clear order/_nr_pages for > splitting purposes. > > Maybe we should reinitialize other large folio flags / folio members as > well when splitting, because they might similarly cause harm once > page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should > not be set for fsdax, so at least page[1]->flags might be as expected with > this fix. > > From a quick glimpse, initializing ->mapping, ->pgmap and ->share should > re-initialize most things from a previous page[1] used by large folios > that fsdax cares about. For example folio->private might not get > reinitialized, but maybe that's not relevant -- no traces of it's use in > fsdax code. Needs a closer look. > > Another thing that should be considered in the future is performing similar > checks as we perform in free_tail_page_prepare() -- checking pincount etc. > -- when freeing a large fsdax folio. > > Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first > tail page") > Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages") > Reported-by: Alison Schofield <alison.schofi...@intel.com> > Closes: https://lkml.kernel.org/r/z_w9oeg-d9fhi...@aschofie-mobl2.lan > Cc: Alexander Viro <v...@zeniv.linux.org.uk> > Cc: Christian Brauner <brau...@kernel.org> > Cc: Jan Kara <j...@suse.cz> > Cc: Dan Williams <dan.j.willi...@intel.com> > Cc: Matthew Wilcox <wi...@infradead.org> > Cc: Andrew Morton <a...@linux-foundation.org> > Cc: Alistair Popple <apop...@nvidia.com> > Cc: Christoph Hellwig <h...@infradead.org> > Signed-off-by: David Hildenbrand <da...@redhat.com> > --- > fs/dax.c | 1 + > include/linux/mm.h | 17 +++++++++++++++++ > 2 files changed, 18 insertions(+)
Explanation excellent, folio_reset_order() looks correct to me and the callsite in fsdax looks correct. Reviewed-by: Dan Williams <dan.j.willi...@intel.com> For consistency and clarity what about this incremental change, to make the __split_folio_to_order() path reuse folio_reset_order(), and use typical bitfield helpers for manipulating _flags_1? diff --git a/include/linux/mm.h b/include/linux/mm.h index bf55206935c4..5b614d31f4f6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -33,6 +33,7 @@ #include <linux/slab.h> #include <linux/cacheinfo.h> #include <linux/rcuwait.h> +#include <linux/bitfield.h> struct mempolicy; struct anon_vma; @@ -1171,7 +1172,7 @@ extern void prep_compound_page(struct page *page, unsigned int order); static inline unsigned int folio_large_order(const struct folio *folio) { - return folio->_flags_1 & 0xff; + return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1); } #ifdef NR_PAGES_IN_LARGE_FOLIO @@ -1229,7 +1230,8 @@ static inline void folio_reset_order(struct folio *folio) { if (WARN_ON_ONCE(!folio_test_large(folio))) return; - folio->_flags_1 &= ~0xffUL; + ClearPageCompound(&folio->page); + folio->_flags_1 &= ~FOLIO_ORDER_MASK; #ifdef NR_PAGES_IN_LARGE_FOLIO folio->_nr_pages = 0; #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 56d07edd01f9..3dc2d98fde24 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -483,6 +483,8 @@ struct folio { }; }; +#define FOLIO_ORDER_MASK GENMASK(7, 0) + #define FOLIO_MATCH(pg, fl) \ static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl)) FOLIO_MATCH(flags, flags); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..301ca9459122 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order, if (new_order) folio_set_order(folio, new_order); else - ClearPageCompound(&folio->page); + folio_reset_order(folio); } /* diff --git a/mm/internal.h b/mm/internal.h index 50c2f590b2d0..41a4d2b66405 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -727,7 +727,8 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) if (WARN_ON_ONCE(!order || !folio_test_large(folio))) return; - folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; + folio->_flags_1 &= ~FOLIO_ORDER_MASK; + folio->_flags_1 |= FIELD_PREP(FOLIO_ORDER_MASK, order); #ifdef NR_PAGES_IN_LARGE_FOLIO folio->_nr_pages = 1U << order; #endif