khugepaged: do synchronous writeback for MADV_COLLAPSE

Lorenzo Stoakes Mon, 10 Nov 2025 08:31:23 -0800

On Mon, Nov 10, 2025 at 01:22:16PM +0000, Lorenzo Stoakes wrote:
> On Mon, Nov 10, 2025 at 06:37:58PM +0530, Garg, Shivank wrote:
> >
> >
> > On 11/10/2025 5:31 PM, Lorenzo Stoakes wrote:
> > > On Mon, Nov 10, 2025 at 11:32:53AM +0000, Shivank Garg wrote:
> > >> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> >
> > >> ---
> > >> Applies cleanly on:
> > >> 6.18-rc5
> > >> mm-stable:e9a6fb0bc
> > >
> > > Please base on mm-unstable. mm-stable is usually out of date until very 
> > > close to
> > > merge window.
> >
> > I'm observing issues when testing with kselftest on mm-unstable and mm-new 
> > branches that prevent
> > proper testing for my patches:
> >
> > On mm-unstable (without my patches):
> >
> > # # running ./transhuge-stress -d 20
> > # # --------------------------------
> > # # TAP version 13
> > # # 1..1
> > # # transhuge-stress: allocate 220271 transhuge pages, using 440543 MiB 
> > virtual memory and 1720 MiB of ram
> >
> >
> > [  367.225667] RIP: 0010:swap_cache_get_folio+0x2d/0xc0
> > [  367.230635] Code: 00 00 48 89 f9 49 89 f9 48 89 fe 48 c1 e1 06 49 c1 e9 
> > 3a 48 c1 e9 0f 48 c1 e1 05 4a 8b 04 cd c0 2e 5b 99 48 8b 78 60 48 01 cf 
> > <48> 8b 47 08 48 85 c0 74 20 48 89 f2 81 e2 ff 01 00 00 48 8d 04 d0
> > [  367.249378] RSP: 0000:ffffcde32943fba8 EFLAGS: 00010282
> > [  367.254605] RAX: ffff8bd1668fdc00 RBX: 00007ffc15df5000 RCX: 
> > 00003fffffffffe0
> > [  367.261736] RDX: ffffffff995cb530 RSI: 0003ffffffffffff RDI: 
> > ffffcbd1560dffe0
> > [  367.268862] RBP: 0003ffffffffffff R08: ffffcde32943fc47 R09: 
> > 0000000000000000
> > [  367.275994] R10: 0000000000000000 R11: 0000000000000000 R12: 
> > 0000000000000000
> > [  367.283129] R13: 0000000000000000 R14: ffff8bd1668fdc00 R15: 
> > 0000000000100cca
> > [  367.290260] FS:  00007ff600af5b80(0000) GS:ffff8c4e9ec7e000(0000) 
> > knlGS:0000000000000000
> > [  367.298344] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  367.304083] CR2: ffffcbd1560dffe8 CR3: 00000001280e9001 CR4: 
> > 0000000000770ef0
> > [  367.311216] PKRU: 55555554
> > [  367.313929] Call Trace:
> > [  367.316375]  <TASK>
> > [  367.318479]  __read_swap_cache_async+0x8e/0x1b0
> > [  367.323014]  swap_vma_readahead+0x23d/0x430
> > [  367.327198]  swapin_readahead+0xb0/0xc0
> > [  367.331039]  do_swap_page+0x5bc/0x1260
> > [  367.334789]  ? rseq_ip_fixup+0x6f/0x190
> > [  367.338631]  ? __pfx_default_wake_function+0x10/0x10
> > [  367.343596]  __handle_mm_fault+0x49a/0x760
> > [  367.347696]  handle_mm_fault+0x188/0x300
> > [  367.351620]  do_user_addr_fault+0x15b/0x6c0
> > [  367.355807]  exc_page_fault+0x60/0x100
> > [  367.359562]  asm_exc_page_fault+0x22/0x30
> > [  367.363574] RIP: 0033:0x7ff60091ba99
> > [  367.367153] Code: f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 40 c4 01 00 f3 
> > 0f 1e fa 80 3d b5 f5 0e 00 00 74 13 31 c0 0f 05 48 3d 00 f0 ff ff 77 4f 
> > <c3> 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75
> > [  367.385897] RSP: 002b:00007ffc15df1118 EFLAGS: 00010203
> > [  367.391124] RAX: 0000000000000001 RBX: 000055941fb672a0 RCX: 
> > 00007ff60091ba91
> > [  367.398256] RDX: 0000000000000001 RSI: 000055941fb813e0 RDI: 
> > 0000000000000000
> > [  367.405387] RBP: 00007ffc15df21e0 R08: 0000000000000000 R09: 
> > 0000000000000007
> > [  367.412513] R10: 000055941fb97cb0 R11: 0000000000000246 R12: 
> > 000055941fb813e0
> > [  367.419646] R13: 0000000000000000 R14: 0000000000000000 R15: 
> > 0000000000000000
> > [  367.426781]  </TASK>
> > [  367.428970] Modules linked in: xfrm_user xfrm_algo xt_addrtype 
> > xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat 
> > nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables 
> > overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl 
> > intel_rapl_msr intel_rapl_common wmi_bmof amd64_edac edac_mce_amd mgag200 
> > rapl drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper 
> > acpi_cpufreq i2c_piix4 ptdma k10temp i2c_smbus wmi acpi_power_meter ipmi_si 
> > acpi_ipmi ipmi_devintf ipmi_msghandler sg dm_multipath drm fuse dm_mod 
> > nfnetlink ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov 
> > async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 kvm_amd 
> > sd_mod ahci nvme libahci kvm libata nvme_core tg3 ccp megaraid_sas irqbypass
> > [  367.497528] CR2: ffffcbd1560dffe8
> > [  367.500846] ---[ end trace 0000000000000000 ]---
>
> Yikes, oopsies!
>
> I'll try running tests locally on threadripper, but ran tests against yours
> previously and seemed fine, strange. Maybe fixed since but let me try, maybe
> because swap is not enabled locally for me?
>
> Likely this actually...


I have tried on swap-enabled setup and no issue with mm-unstable.

So this is odd, I know you have limited time (_totally sympathise_) but is it at
all possible if you get a moment to bisect against tip mm-unstable/mm-new?

Obviously we want to make sure buggy swap code doesn't get merged to mainline!

>
> >
> >
> >
> > -----------------
> > On mm-new (without my patches):
> >
> > [  394.144770] get_swap_device: Bad swap offset entry 3ffffffffffff
> >
> > dmesg | grep "get_swap_device: Bad swap offset entry" | wc -l
> > 359
> >
> >
> > Additionally, kexec triggers an oops and crash during swapoff:
> >
> >
> >          Deactivating swap   704.854238] BUG: unable to handle page fault 
> > for address: ffffcbe2de8dffe8
> > [  704.861524] #PF: supervisor read access in kernel mode
> > ;39mswap.img.swa[  704.866666] #PF: error_code(0x0000) - not-present page
> > [  704.873253] PGD 0 P4D 0
> > p - /swap.im[  704.875790] Oops: Oops: 0000 [#1] SMP NOPTI
> > g...
> > [  704.881354] CPU: 122 UID: 0 PID: 107680 Comm: swapoff Kdump: loaded Not 
> > tainted 6.18.0-rc5+ #11 NONE
> > [  704.891283] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.16.2 
> > 07/09/2024
> > [  704.898930] RIP: 0010:swap_cache_get_folio+0x2d/0xc0
> > [  704.903907] Code: 00 00 48 89 f9 49 89 f9 48 89 fe 48 c1 e1 06 49 c1 e9 
> > 3a 48 c1 e9 0f 48 c1 e1 05 4a 8b 04 cd c0 2e 7b 95 48 8b 78 60 48 01 cf 
> > <48> 8b 47 08 48 85 c0 74 20 48 89 f2 81 e2 ff 01 00 00 48 8d 04 d0
> > [  704.922720] RSP: 0018:ffffcf1227b1fc08 EFLAGS: 00010282
> > [  704.928035] RAX: ffff8be2cefb3c00 RBX: 0000555c65a5c000 RCX: 
> > 00003fffffffffe0
> > [  704.928036] RDX: 0003ffffffffffff RSI: 0003ffffffffffff RDI: 
> > ffffcbe2de8dffe0
> > [  704.928037] RBP: 0000000000000000 R08: ffff8be2de8e0520 R09: 
> > 0000000000000000
> >          Unmount[  704.928038] R10: 000000000000ffff R11: ffffcf12236f4000 
> > R12: ffff8be2d5b8d968
> > [  704.928039] R13: 0003ffffffffffff R14: fffff3eec85eb000 R15: 
> > 0000555c65a51000
> > [  704.928039] FS:  00007f41fcab3800(0000) GS:ffff8c602b6fe000(0000) 
> > knlGS:0000000000000000
> > [  704.928040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  704.928041] CR2: ffffcbe2de8dffe8 CR3: 00000074981af004 CR4: 
> > 0000000000770ef0
> > [  704.928041] PKRU: 55555554
> > [  704.928042] Call Trace:
> > [  704.928043]  <TASK>
> > [  704.928044]  unuse_pte_range+0x10b/0x290
> > [  704.928047]  unuse_pud_range.isra.0+0x149/0x190
> > [  704.928048]  unuse_vma+0x1a6/0x220
> > [  704.928050]  unuse_mm+0x9b/0x110
> > [  704.928052]  try_to_unuse+0xc5/0x260
> > [  704.928053]  __do_sys_swapoff+0x244/0x670
> > ing boo[  705.016662]  do_syscall_64+0x67/0xc50
> > [  705.016667]  ? do_user_addr_fault+0x15b/0x6c0
> > t.mount - /b[  705.026100]  ? exc_page_fault+0x60/0x100
> > [  705.031498]  ? irqentry_exit_to_user_mode+0x20/0xe0
> > oot...
> > [  705.036377]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  705.042200] RIP: 0033:0x7f41fc9271bb
> > [  705.045780] Code: 0f 1e fa 48 83 fe 01 48 8b 15 59 bc 0d 00 19 c0 83 e0 
> > f0 83 c0 26 64 89 02 b8 ff ff ff ff c3 f3 0f 1e fa b8 a8 00 00 00 0f 05 
> > <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d bc 0d 00 f7 d8 64 89 01 48
> > [  705.064807] RSP: 002b:00007ffd14b5b6e8 EFLAGS: 00000202 ORIG_RAX: 
> > 00000000000000a8
> > [  705.064809] RAX: ffffffffffffffda RBX: 00007ffd14b5cf30 RCX: 
> > 00007f41fc9271bb
> > [  705.064810] RDX: 0000000000000001 RSI: 0000000000000c00 RDI: 
> > 000055d48f533a40
> > [  705.064810] RBP: 00007ffd14b5b750 R08: 00007f41fca03b20 R09: 
> > 0000000000000000
> > [  705.064811] R10: 0000000000000001 R11: 0000000000000202 R12: 
> > 0000000000000000
> > [  705.064811] R13: 0000000000000000 R14: 000055d4584f1479 R15: 
> > 000055d4584f2b20
> > [  705.064813]  </TASK>
> > [  705.064814] Modules linked in: xfrm_user xfrm_algo xt_addrtype 
> > xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat 
> > nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables 
> > overlay bridge stp llc cfg80211 rfkill binfmt_misc ipmi_ssif amd_atl 
> > intel_rapl_msr intel_rapl_common wmi_bmof amd64_edac edac_mce_amd rapl 
> > mgag200 drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper 
> > acpi_cpufreq i2c_piix4 ptdma ipmi_si k10temp i2c_smbus acpi_power_meter wmi 
> > acpi_ipmi ipmi_msghandler sg dm_multipath fuse drm dm_mod nfnetlink ext4 
> > crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq 
> > async_xor xor async_tx raid6_pq raid1 raid0 sd_mod kvm_amd ahci libahci kvm 
> > nvme tg3 libata ccp irqbypass nvme_core megaraid_sas [last unloaded: 
> > ipmi_devintf]
> > [  705.180420] CR2: ffffcbe2de8dffe8
> > [  705.183852] ---[ end trace 0000000000000000 ]---
> >
> >
> > I haven't had cycles to dig into this yet and been swamped with other 
> > things.
>
> Fully understand, I'm _very_ familiar with this situation :)
>
> I need more cores... ;)

Oh it's nice to have more :) I am bankrupt now, but it's nice to have more ;)

Cheers, Lorenzo

Re: [PATCH 1/2] mm/khugepaged: do synchronous writeback for MADV_COLLAPSE

Reply via email to