MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages are dirty. This affects scenarios like package/container updates or executing binaries immediately after writing them, etc.
The issue is that collapse_file() triggers async writeback and returns SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But MADV_COLLAPSE is synchronous and userspace expects immediate success or a clear retry signal. Reproduction: - Compile or copy 2MB-aligned executable to XFS/ext4 FS - Call MADV_COLLAPSE on .text section - First call fails with -EINVAL (text pages dirty from copy) - Second call succeeds (async writeback completed) Issue Report: https://lore.kernel.org/all/[email protected] Changelog: V3: - Reordered patches: Enum definition comes first as the retry logic depends on it - Renamed SCAN_PAGE_NOT_CLEAN to SCAN_PAGE_DIRTY_OR_WRITEBACK (Dev, Lance, David) - Changed writeback logic: Only trigger synchronous writeback and retry if the initial collapse attempt failed specifically due to dirty/writeback pages, rather than blindly flushing all file-backed VMAs (David) - Added proper file reference counting (get_file/fput) around the unlock window to prevent UAF (Lance) V2: - https://lore.kernel.org/all/[email protected] - Move writeback to madvise_collapse() (better abstraction, proper mmap_lock handling and does VMA revalidation after I/O) (Lorenzo) - Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use for all dirty/writeback folio cases that previously returned incorrect results (Dev) V1: https://lore.kernel.org/all/[email protected] Thanks, Shivank Garg (2): mm/khugepaged: map dirty/writeback pages failures to EAGAIN mm/khugepaged: retry with sync writeback for MADV_COLLAPSE include/trace/events/huge_memory.h | 3 +- mm/khugepaged.c | 49 ++++++++++++++++++++++++++++-- 2 files changed, 48 insertions(+), 4 deletions(-) base-commit: 2178727587e1eaa930b8266377119ed6043067df -- 2.43.0
