On 09.04.25 02:20, Alison Schofield wrote:
Hi David, because this bisected to a patch you posted
Hi Alistair,  because vmf_insert_page_mkwrite() is in the path

Hi!


A DAX unit test began failing on 6.15-rc1. I chased it as described below, but
need XFS and/or your Folio/tail page accounting knowledge to take it further.

A DAX XFS mappings that is SHARED and R/W fails when the folio is
unexpectedly NULL. Note that XFS PRIVATE always succeeds and XFS SHARED,
READ_ONLY works fine. Also note that it works all the ways with EXT4.


Huh, but why is the folio NULL?

insert_page_into_pte_locked() does "folio = page_folio(page)" and then even calls folio_get(folio) before calling folio_add_file_rmap_pte().

folio_add_file_rmap_ptes()->__folio_add_file_rmap() just passes the folio pointer along.

The RIP seems to be in __lruvec_stat_mod_folio(), so I assume we end up in __folio_mod_stat()->__lruvec_stat_mod_folio().

There, we call folio_memcg(folio). Likely we're not getting NULL back, which we could handle, but instead "0000000000000b00"

So maybe the memcg we get is "almost NULL", and not the folio ?

[  417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
[  417.796982] #PF: supervisor read access in kernel mode
[  417.797540] #PF: error_code(0x0000) - not-present page
[  417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
[  417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
[  417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: G           O        
6.15.0-rc1-dirty #158 PREEMPT(voluntary)
[  417.800150] Tainted: [O]=OOT_MODULE
[  417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
02/06/2015
[  417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
[  417.801948] Code: 85 97 00 00 00 48 8b 43 38 48 89 c3 48 83 e3 f8 a8 02 0f 85 1a 
01 00 00 48 85 db 0f 84 28 01 00 00 66 90 49 63 86 80 3e 00 00 <48> 8b 9c c3 00 
09 00 00 48 83 c3 40 4c 3b b3 c0 00 00 00 0f 85 68
[  417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
[  417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[  417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
[  417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
[  417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
[  417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
[  417.807801] FS:  00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) 
knlGS:0000000000000000
[  417.808570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
[  417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  417.811353] Call Trace:
[  417.811709]  <TASK>
[  417.812038]  folio_add_file_rmap_ptes+0x143/0x230
[  417.812566]  insert_page_into_pte_locked+0x1ee/0x3c0
[  417.813132]  insert_page+0x78/0xf0
[  417.813558]  vmf_insert_page_mkwrite+0x55/0xa0
[  417.814088]  dax_fault_iter+0x484/0x7b0
[  417.814542]  dax_iomap_pte_fault+0x1ca/0x620
[  417.815055]  dax_iomap_fault+0x39/0x40
[  417.815499]  __xfs_write_fault+0x139/0x380
[  417.815995]  ? __handle_mm_fault+0x5e5/0x1a60
[  417.816483]  xfs_write_fault+0x41/0x50
[  417.816966]  xfs_filemap_fault+0x3b/0xe0
[  417.817424]  __do_fault+0x31/0x180
[  417.817859]  __handle_mm_fault+0xee1/0x1a60
[  417.818325]  ? debug_smp_processor_id+0x17/0x20
[  417.818844]  handle_mm_fault+0xe1/0x2b0
[  417.819286]  do_user_addr_fault+0x217/0x630
[  417.819747]  ? rcu_is_watching+0x11/0x50
[  417.820185]  exc_page_fault+0x6c/0x210
[  417.820599]  asm_exc_page_fault+0x27/0x30
[  417.821080] RIP: 0033:0x40130c
[  417.821461] Code: 89 7d d8 48 89 75 d0 e8 94 ff ff ff 48 c7 45 f8 00 00 00 00 48 
8b 45 d8 48 89 45 f0 eb 18 48 8b 45 f0 48 8d 50 08 48 89 55 f0 <48> c7 00 01 00 
00 00 48 83 45 f8 01 48 8b 45 d0 48 c1 e8 03 48 39
[  417.823156] RSP: 002b:00007ffcc82a8cb0 EFLAGS: 00010287
[  417.823703] RAX: 00007f336f5f5000 RBX: 00007ffcc82a8f08 RCX: 0000000067f5a1da
[  417.824382] RDX: 00007f336f5f5008 RSI: 0000000000000000 RDI: 0000000000036a98
[  417.825096] RBP: 00007ffcc82a8ce0 R08: 00007f339fa84000 R09: 00000000004040b0
[  417.825769] R10: 00007f339fa8a200 R11: 00007f339fa8a7b0 R12: 0000000000000000
[  417.826438] R13: 00007ffcc82a8f28 R14: 0000000000403e18 R15: 00007f339fac3000
[  417.827148]  </TASK>
[  417.827461] Modules linked in: nd_pmem(O) dax_pmem(O) nd_btt(O) nfit(O) 
nd_e820(O) libnvdimm(O) nfit_test_iomap(O)
[  417.828404] CR2: 0000000000000b00
[  417.828807] ---[ end trace 0000000000000000 ]---
[  417.829293] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250


And then, looking at the page passed to vmf_insert_page_mkwrite():

[   55.468109] flags: 0x300000000002009(locked|uptodate|reserved|node=0|zone=3)

reserved might indicate ZONE_DEVICE. But zone=3 might or might not be ZONE_DEVICE (depending on the kernel config).

[   55.468674] raw: 0300000000002009 ffff888028c27b20 00000000ffffffff 
ffff888033b69b88
[   55.469270] raw: 000000000000fff5 0000000000000000 00000001ffffffff 
0000000000000200
[   55.469835] page dumped because: ALISON dump locked & uptodate pages

Do you have the other (earlier) output from __dump_page(), especially if this page is part of a large folio etc?

Trying to decipher:

0300000000002009 -> "unsigned long flags"
ffff888028c27b20 -> big union

As the big union overlays "unsigned long compound_head", and the last bit is not set, this should be a *small folio*.

That would mean that "0000000000000200" would be "unsigned long memcg_data".

0x200 might have been the folio_nr_pages before the large folio was split. Likely, we are not clearing that when splitting the large folio, resulting in a false-positive "memcg_data" after the split.


^ That's different:  locked|uptodate. Other page flags arriving here are
not locked | uptodate.

Git bisect says this is first bad patch (6.14 --> 6.15-rc1)
4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")

Experimenting a bit with the patch, UN-defining NR_PAGES_IN_LARGE_FOLIO,
avoids the problem.

The way that patch is reusing memory in tail pages and the fact that it
only fails in XFS (not ext4) suggests the XFS is depending on tail pages
in a way that ext4 does not.

IIRC, XFS supports large folios but ext4 does not. But I don't really know how that interacts with DAX (if the same thing applies). Ordinary XFS large folio tests seem to work just fine, so the question is what DAX-specific is happening here.

When we free large folios back to the buddy, we set "folio->_nr_pages = 0", to make the "page->memcg_data" check in page_bad_reason() happy. Also, just before the large folio split for ordinary large folios, we set "folio->_nr_pages = 0".

Maybe there is something missing in ZONE_DEVICE freeing/splitting code of large folios, where we should do the same, to make sure that all page->memcg_data is actually 0?

I assume so. Let me dig.

--
Cheers,

David / dhildenb


Reply via email to