On 6/22/26 17:55, Vlastimil Babka (SUSE) wrote:
> On 6/18/26 13:43, Wandun wrote:
>>
>>
>> On 6/18/26 02:52, Vlastimil Babka (SUSE) wrote:
>>> On 6/4/26 04:38, Wandun Chen wrote:
>>>> From: Wandun Chen <[email protected]>
>>>>
>>>> compact_unevictable_allowed is default 0 under PREEMPT_RT,
>>>> isolate_migratepages_block() skips folios with PG_unevictable set.
>>>> However, mlock_folio() sets PG_mlocked immediately but defers
>>>> PG_unevictable to mlock_folio_batch(), result in a folio with
>>>> PG_mlocked=1 but PG_unevictable=0. Compaction will isolate such a
>>>> folio.
>>>>
>>>> Fix by checking folio_test_mlocked() together with the existing
>>>> folio_test_unevictable() check.
>>>>
>>>> A similar issue has been reported by Alexander Krabler on a 6.12-rt
>>>> aarch64 system. Vlastimil suggested to check the mlocked flag [1].
>>>>
>>>> Reported-by: Alexander Krabler <[email protected]>
>>>> Closes:
>>>> https://lore.kernel.org/all/du0pr01mb10385345f7153f3341009818882...@du0pr01mb10385.eurprd01.prod.exchangelabs.com/
>>>> Suggested-by: Vlastimil Babka <[email protected]>
>>>> Signed-off-by: Wandun Chen <[email protected]>
>>>> Link:
>>>> https://lore.kernel.org/all/[email protected]/
>>>> [1]
>>>
>>> Well in that thread, Hugh doubted my suggestion and then it seems we didn't
>>> concluded anything. Did you actually in practice observe the issue that
>>> Alexander had, and that this patch fixed it, or is that theoretical?
>>>
>> Yes, I wrote a test case that can reproduce it in a few second.
>>
>> The test case contains 3 steps:
>> 1. mlockall
>> 2. mmap file(2GB) + trigger file write page fault;
>> 3. during step 1, trigger compact via /proc/sys/vm/compact_memory
>>
>>
>> My reproduction environment is qemu with 4GB ram, 8 core, aarch64,
>> preempt_rt and includes the tracepoint in patch 02.
>> After running the reproduction program for a few seconds, the
>> following output appears.
>
> Ah, nice.
>
>> repro-403 [004] ....1 101.270505: mm_compaction_isolate_folio:
>> pfn=0x71e3a mode=0x0 flags=referenced|uptodate|mlocked
>> repro-403 [004] ....1 101.270507: mm_compaction_isolate_folio:
>> pfn=0x71e3b mode=0x0 flags=referenced|uptodate|mlocked
>> repro-403 [004] ....1 101.270513: mm_compaction_isolate_folio:
>> pfn=0x71e3c mode=0x0 flags=referenced|uptodate|mlocked
>> repro-403 [004] ....1 101.270515: mm_compaction_isolate_folio:
>> pfn=0x71e3d mode=0x0 flags=uptodate|mlocked
>> repro-403 [004] ....1 101.270517: mm_compaction_isolate_folio:
>> pfn=0x71e3e mode=0x0 flags=uptodate|mlocked
>> repro-403 [004] ....1 101.270520: mm_compaction_isolate_folio:
>> pfn=0x71e3f mode=0x0 flags=uptodate|mlocked
>>
>>
>> Unfortunately, I recently found that there is still a bug in the
>> fix patch. Setting mlocked in the mlock_folio function could happen
>> even after the page is successfully isolated, so it still cannot
>> prevent migration. Because of this, I need to think more about how
>> to fix it.
>>
>> Perhaps we should double-check whether the page is mlocked during
>> the actual migration phase.
>
> So IIUC the isolation+migration might be started between the folio is
> allocated, and mlocked? In that case the check during migration could still
Yes, in that case it still be racy, it is not a good idea to check page flags.
> be racy, and if the page is isolated, it's already bad for the RT process.
IIUC, more accurately, the migration entry in the page talbe is real a bad for
RT process, because isolate page doesn't modify the page table, so memory
access continues as usual, therefore a new idea occur.
S1. In the mlock[all] syscall, if mlock_vma_pages_range hit a migration entry,
then, it should wait for the migration to complete.
S2. During the unmap phase of memory migration, prevent a page from being
unmapped
if the page's associated vma is markd with VM_LOCKED, similar to how
reclaim is
disabled for pages in a VM_LOCKED vma(try_to_unmap_one).
For a page handled during the mlock[all] syscall:
- if migration has been already finished, there is noting to do;
- if migration is in progress and the migration etnry is already filled, we
wait (S1)
- if the page is in-fight, going to be isolated/migrated, S2 prevents the
unmap.
For a page handled during a page fault: VM_LOCKED is already set on the vma,
so S2 guarantees it will not be unmapped, hence no migration entry.
Thanks a lot for the detailed feedback, Vlastimil.
Best regards,
Wandun
>
> So this would only be a short-term problem after the mlockall, but we don't
> have a way for the RT process to know the moment it's all settled, right?
Yes, some pages may have been isolated and will do migration.
> Probably the proper solution would be for mlock[all]() itself to wait for an
> isolated page, and only continue once it knows it can't be isolated anymore.
> This might howver would go against some of the folio batching optimizations?
>
>> What do you think of this best-effort approach?
>>
>>
>> Best regards,
>> Wandun
>>
>>
>>
>>
>>
>> The full reproducer is as below:
>>
>> /* gcc repro.c -o repro -lpthread */
>>
>> #define _GNU_SOURCE
>> #include <fcntl.h>
>> #include <pthread.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <unistd.h>
>>
>> #define PAGE_SIZE 4096
>> #define NR_PAGES 32
>> #define FILE_SIZE (2ULL * 1024 * 1024 * 1024)
>>
>> static void *worker_fn(void *arg)
>> {
>> int fd = (long)arg;
>> size_t len = (size_t)FILE_SIZE;
>> char *p = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> if (p == MAP_FAILED)
>> return NULL;
>>
>> for (size_t off = 0; off + NR_PAGES * PAGE_SIZE <= len;
>> off += NR_PAGES * PAGE_SIZE) {
>> for (int i = 0; i < NR_PAGES; i++)
>> p[off + i * PAGE_SIZE] = 1;
>> usleep(200);
>> }
>>
>> munmap(p, len);
>> return NULL;
>> }
>>
>> static void *compact_fn(void *arg)
>> {
>> (void)arg;
>> int fd = open("/proc/sys/vm/compact_memory", O_WRONLY);
>> if (fd < 0)
>> return NULL;
>>
>> while (1) {
>> if (write(fd, "1", 1) < 0) {}
>> usleep(5000);
>> }
>> }
>>
>> int main(void)
>> {
>> mlockall(MCL_CURRENT | MCL_FUTURE);
>>
>> int fd = open("./repro_largefile.dat", O_RDWR | O_CREAT, 0600);
>> if (fd < 0)
>> return 1;
>> unlink("./repro_largefile.dat");
>> if (ftruncate(fd, (off_t)FILE_SIZE) < 0)
>> return 1;
>>
>> printf("repro_largefile: 1 worker, %d pages/batch, Ctrl-C to stop\n",
>> NR_PAGES);
>>
>> pthread_t compact, worker;
>> pthread_create(&compact, NULL, compact_fn, NULL);
>> pthread_create(&worker, NULL, worker_fn, (void *)(long)fd);
>>
>> pthread_join(worker, NULL);
>> return 0;
>> }
>>
>>>> ---
>>>> mm/compaction.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index b776f35ad020..7e07b792bcb5 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -1116,7 +1116,8 @@ isolate_migratepages_block(struct compact_control
>>>> *cc, unsigned long low_pfn,
>>>> is_unevictable = folio_test_unevictable(folio);
>>>>
>>>> /* Compaction might skip unevictable pages but CMA takes them */
>>>> - if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
>>>> + if (!(mode & ISOLATE_UNEVICTABLE) &&
>>>> + (is_unevictable || folio_test_mlocked(folio)))
>>>> goto isolate_fail_put;
>>>>
>>>> /*
>>>
>>
>