On 09.04.25 02:20, Alison Schofield wrote:
Hi David, because this bisected to a patch you posted
Hi Alistair, because vmf_insert_page_mkwrite() is in the path
Hi!
A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
need XFS and/or your Folio/tail page accounting knowledge to take it further.
A DAX XFS mappings that is SHARED and R/W fails when the folio is
unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
READ_ONLY works fine. Also note that it works all the ways with EXT4.
Huh, but why is the folio NULL?
insert_page_into_pte_locked() does "folio = page_folio(page)" and then
even calls folio_get(folio) before calling folio_add_file_rmap_pte().
folio_add_file_rmap_ptes()->__folio_add_file_rmap() just passes the
folio pointer along.
The RIP seems to be in __lruvec_stat_mod_folio(), so I assume we end up
in __folio_mod_stat()->__lruvec_stat_mod_folio().
There, we call folio_memcg(folio). Likely we're not getting NULL back,
which we could handle, but instead "0000000000000b00"
So maybe the memcg we get is "almost NULL", and not the folio ?
[ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[ 417.796982] #PF: supervisor read access in kernel mode
[ 417.797540] #PF: error_code(0x0000) - not-present page
[ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G O
6.15.0-rc1-dirty #158 PREEMPT(voluntary)
[ 417.800150] Tainted: [O]=OOT_MODULE
[ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0
02/06/2015
[ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[ 417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a
01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00
09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
[ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000)
knlGS:0000000000000000
[ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 417.811353] Call Trace:
[ 417.811709] <TASK>
[ 417.812038] folio_add_file_rmap_ptes+0x143/0x230
[ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0
[ 417.813132] insert_page+0x78/0xf0
[ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0
[ 417.814088] dax_fault_iter+0x484/0x7b0
[ 417.814542] dax_iomap_pte_fault+0x1ca/0x620
[ 417.815055] dax_iomap_fault+0x39/0x40
[ 417.815499] __xfs_write_fault+0x139/0x380
[ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60
[ 417.816483] xfs_write_fault+0x41/0x50
[ 417.816966] xfs_filemap_fault+0x3b/0xe0
[ 417.817424] __do_fault+0x31/0x180
[ 417.817859] __handle_mm_fault+0xee1/0x1a60
[ 417.818325] ? debug_smp_processor_id+0x17/0x20
[ 417.818844] handle_mm_fault+0xe1/0x2b0
[ 417.819286] do_user_addr_fault+0x217/0x630
[ 417.819747] ? rcu_is_watching+0x11/0x50
[ 417.820185] exc_page_fault+0x6c/0x210
[ 417.820599] asm_exc_page_fault+0x27/0x30
[ 417.821080] RIP: 0033:0x40130c
[ 417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48
8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00
00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
[ 417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
[ 417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
[ 417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
[ 417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
[ 417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
[ 417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
[ 417.827148] </TASK>
[ 417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O)
nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
[ 417.828404] CR2: 0000000000000b00
[ 417.828807] ---[ end trace 0000000000000000 ]---
[ 417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
And then, looking at the page passed to vmf_insert_page_mkwrite():
[ 55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)
reserved might indicate ZONE_DEVICE. But zone=3 might or might not be
ZONE_DEVICE (depending on the kernel config).
[ 55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff
ffff888033b69b88
[ 55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff
0000000000000200
[ 55.469835] page dumped because: ALISON dump locked & uptodate pages
Do you have the other (earlier) output from __dump_page(), especially if
this page is part of a large folio etc?
Trying to decipher:
0300000000002009 -> "unsigned long flags"
ffff888028c27b20 -> big union
As the big union overlays "unsigned long compound_head", and the last
bit is not set, this should be a *small folio*.
That would mean that "0000000000000200" would be "unsigned long memcg_data".
0x200 might have been the folio_nr_pages before the large folio was
split. Likely, we are not clearing that when splitting the large folio,
resulting in a false-positive "memcg_data" after the split.
^ That's different: locked|uptodate. Other page flags arriving here are
not locked | uptodate.
Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
avoids the problem.
The way that patch is reusing memory in tail pages and the fact that it
only fails in XFS (not ext4) suggests the XFS is depending on tail pages
in a way that ext4 does not.
IIRC, XFS supports large folios but ext4 does not. But I don't really
know how that interacts with DAX (if the same thing applies). Ordinary
XFS large folio tests seem to work just fine, so the question is what
DAX-specific is happening here.
When we free large folios back to the buddy, we set "folio->_nr_pages =
0", to make the "page->memcg_data" check in page_bad_reason() happy.
Also, just before the large folio split for ordinary large folios, we
set "folio->_nr_pages = 0".
Maybe there is something missing in ZONE_DEVICE freeing/splitting code
of large folios, where we should do the same, to make sure that all
page->memcg_data is actually 0?
I assume so. Let me dig.
--
Cheers,
David / dhildenb