On 7 Jan 2026, at 21:17, Matthew Brost wrote: > On Thu, Jan 08, 2026 at 11:56:03AM +1100, Balbir Singh wrote: >> On 1/8/26 08:03, Zi Yan wrote: >>> On 7 Jan 2026, at 16:15, Matthew Brost wrote: >>> >>>> On Wed, Jan 07, 2026 at 03:38:35PM -0500, Zi Yan wrote: >>>>> On 7 Jan 2026, at 15:20, Zi Yan wrote: >>>>> >>>>>> +THP folks >>>>> >>>>> +willy, since he commented in another thread. >>>>> >>>>>> >>>>>> On 16 Dec 2025, at 15:10, Francois Dugast wrote: >>>>>> >>>>>>> From: Matthew Brost <[email protected]> >>>>>>> >>>>>>> Introduce migrate_device_split_page() to split a device page into >>>>>>> lower-order pages. Used when a folio allocated as higher-order is freed >>>>>>> and later reallocated at a smaller order by the driver memory manager. >>>>>>> >>>>>>> Cc: Andrew Morton <[email protected]> >>>>>>> Cc: Balbir Singh <[email protected]> >>>>>>> Cc: [email protected] >>>>>>> Cc: [email protected] >>>>>>> Signed-off-by: Matthew Brost <[email protected]> >>>>>>> Signed-off-by: Francois Dugast <[email protected]> >>>>>>> --- >>>>>>> include/linux/huge_mm.h | 3 +++ >>>>>>> include/linux/migrate.h | 1 + >>>>>>> mm/huge_memory.c | 6 ++--- >>>>>>> mm/migrate_device.c | 49 +++++++++++++++++++++++++++++++++++++++++ >>>>>>> 4 files changed, 56 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>>>>> index a4d9f964dfde..6ad8f359bc0d 100644 >>>>>>> --- a/include/linux/huge_mm.h >>>>>>> +++ b/include/linux/huge_mm.h >>>>>>> @@ -374,6 +374,9 @@ int __split_huge_page_to_list_to_order(struct page >>>>>>> *page, struct list_head *list >>>>>>> int folio_split_unmapped(struct folio *folio, unsigned int new_order); >>>>>>> unsigned int min_order_for_split(struct folio *folio); >>>>>>> int split_folio_to_list(struct folio *folio, struct list_head *list); >>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order, >>>>>>> + struct page *split_at, struct xa_state *xas, >>>>>>> + struct address_space *mapping, enum >>>>>>> split_type split_type); >>>>>>> int folio_check_splittable(struct folio *folio, unsigned int new_order, >>>>>>> enum split_type split_type); >>>>>>> int folio_split(struct folio *folio, unsigned int new_order, struct >>>>>>> page *page, >>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h >>>>>>> index 26ca00c325d9..ec65e4fd5f88 100644 >>>>>>> --- a/include/linux/migrate.h >>>>>>> +++ b/include/linux/migrate.h >>>>>>> @@ -192,6 +192,7 @@ void migrate_device_pages(unsigned long *src_pfns, >>>>>>> unsigned long *dst_pfns, >>>>>>> unsigned long npages); >>>>>>> void migrate_device_finalize(unsigned long *src_pfns, >>>>>>> unsigned long *dst_pfns, unsigned long npages); >>>>>>> +int migrate_device_split_page(struct page *page); >>>>>>> >>>>>>> #endif /* CONFIG_MIGRATION */ >>>>>>> >>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>>>> index 40cf59301c21..7ded35a3ecec 100644 >>>>>>> --- a/mm/huge_memory.c >>>>>>> +++ b/mm/huge_memory.c >>>>>>> @@ -3621,9 +3621,9 @@ static void __split_folio_to_order(struct folio >>>>>>> *folio, int old_order, >>>>>>> * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio >>>>>>> might be >>>>>>> * split but not to @new_order, the caller needs to check) >>>>>>> */ >>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order, >>>>>>> - struct page *split_at, struct xa_state *xas, >>>>>>> - struct address_space *mapping, enum split_type >>>>>>> split_type) >>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order, >>>>>>> + struct page *split_at, struct xa_state *xas, >>>>>>> + struct address_space *mapping, enum >>>>>>> split_type split_type) >>>>>>> { >>>>>>> const bool is_anon = folio_test_anon(folio); >>>>>>> int old_order = folio_order(folio); >>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>>>>> index 23379663b1e1..eb0f0e938947 100644 >>>>>>> --- a/mm/migrate_device.c >>>>>>> +++ b/mm/migrate_device.c >>>>>>> @@ -775,6 +775,49 @@ int migrate_vma_setup(struct migrate_vma *args) >>>>>>> EXPORT_SYMBOL(migrate_vma_setup); >>>>>>> >>>>>>> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION >>>>>>> +/** >>>>>>> + * migrate_device_split_page() - Split device page >>>>>>> + * @page: Device page to split >>>>>>> + * >>>>>>> + * Splits a device page into smaller pages. Typically called when >>>>>>> reallocating a >>>>>>> + * folio to a smaller size. Inherently racy—only safe if the caller >>>>>>> ensures >>>>>>> + * mutual exclusion within the page's folio (i.e., no other threads >>>>>>> are using >>>>>>> + * pages within the folio). Expected to be called a free device page >>>>>>> and >>>>>>> + * restores all split out pages to a free state. >>>>>>> + */ >>>>> >>>>> Do you mind explaining why __split_unmapped_folio() is needed for a free >>>>> device >>>>> page? A free page is not supposed to be a large folio, at least from a >>>>> core >>>>> MM point of view. __split_unmapped_folio() is intended to work on large >>>>> folios >>>>> (or compound pages), even if the input folio has refcount == 0 (because >>>>> it is >>>>> frozen). >>>>> >>>> >>>> Well, then maybe this is a bug in core MM where the freed page is still >>>> a THP. Let me explain the scenario and why this is needed from my POV. >>>> >>>> Our VRAM allocator in Xe (and several other DRM drivers) is DRM buddy. >>>> This is a shared pool between traditional DRM GEMs (buffer objects) and >>>> SVM allocations (pages). It doesn’t have any view of the page backing—it >>>> basically just hands back a pointer to VRAM space that we allocate from. >>>> From that, if it’s an SVM allocation, we can derive the device pages. >>>> >>>> What I see happening is: a 2M buddy allocation occurs, we make the >>>> backing device pages a large folio, and sometime later the folio >>>> refcount goes to zero and we free the buddy allocation. Later, the buddy >>>> allocation is reused for a smaller allocation (e.g., 4K or 64K), but the >>>> backing pages are still a large folio. Here is where we need to split >>> >>> I agree with you that it might be a bug in free_zone_device_folio() based >>> on my understanding. Since zone_device_page_init() calls >>> prep_compound_page() >>> for >0 orders, but free_zone_device_folio() never reverse the process. >>> >>> Balbir and Alistair might be able to help here. >> >> I agree it's an API limitation
I am not sure. If free_zone_device_folio() does not get rid of large folio metadata, there is no guarantee that a freed large device private folio will be reallocated as a large device private folio. And when mTHP support is added, the folio order might change too. That can cause issues when compound_head() is called on a tail page of a previously large folio, since compound_head() will return the old head page instead of the tail page itself. >> >>> >>> I cherry picked the code from __free_frozen_pages() to reverse the process. >>> Can you give it a try to see if it solve the above issue? Thanks. >>> >>> From 3aa03baa39b7e62ea079e826de6ed5aab3061e46 Mon Sep 17 00:00:00 2001 >>> From: Zi Yan <[email protected]> >>> Date: Wed, 7 Jan 2026 16:49:52 -0500 >>> Subject: [PATCH] mm/memremap: free device private folio fix >>> Content-Type: text/plain; charset="utf-8" >>> >>> Signed-off-by: Zi Yan <[email protected]> >>> --- >>> mm/memremap.c | 15 +++++++++++++++ >>> 1 file changed, 15 insertions(+) >>> >>> diff --git a/mm/memremap.c b/mm/memremap.c >>> index 63c6ab4fdf08..483666ff7271 100644 >>> --- a/mm/memremap.c >>> +++ b/mm/memremap.c >>> @@ -475,6 +475,21 @@ void free_zone_device_folio(struct folio *folio) >>> pgmap->ops->folio_free(folio); >>> break; >>> } >>> + >>> + if (nr > 1) { >>> + struct page *head = folio_page(folio, 0); >>> + >>> + head[1].flags.f &= ~PAGE_FLAGS_SECOND; >>> +#ifdef NR_PAGES_IN_LARGE_FOLIO >>> + folio->_nr_pages = 0; >>> +#endif >>> + for (i = 1; i < nr; i++) { >>> + (head + i)->mapping = NULL; >>> + clear_compound_head(head + i); >> >> I see that your skipping the checks in free_page_tail_prepare()? IIUC, we >> should be able >> to invoke it even for zone device private pages I am not sure about what part of compound page is also used in device private folio. Yes, it is better to add right checks. >> >>> + } >>> + folio->mapping = NULL; >> >> This is already done in free_zone_device_folio() >> >>> + head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; >> >> I don't think this is required for zone device private folios, but I suppose >> it >> keeps the code generic >> > > Well, the above code doesn’t work, but I think it’s the right idea. > clear_compound_head aliases to pgmap, which we don’t want to be NULL. I Thank you for pointing it out. I am not familiar with device private page code. > believe the individual pages likely need their flags cleared (?), and Yes, I missed the tail page flag clearing part. > this step must be done before calling folio_free and include a barrier, > as the page can be immediately reallocated. > > So here’s what I came up with, and it seems to work (for Xe, GPU SVM): > > mm/memremap.c | 21 +++++++++++++++++++++ > 1 file changed, 21 insertions(+) > > diff --git a/mm/memremap.c b/mm/memremap.c > index 63c6ab4fdf08..ac20abb6a441 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -448,6 +448,27 @@ void free_zone_device_folio(struct folio *folio) > pgmap->type != MEMORY_DEVICE_GENERIC) > folio->mapping = NULL; > > + if (nr > 1) { > + struct page *head = folio_page(folio, 0); > + > + head[1].flags.f &= ~PAGE_FLAGS_SECOND; > +#ifdef NR_PAGES_IN_LARGE_FOLIO > + folio->_nr_pages = 0; > +#endif > + for (i = 1; i < nr; i++) { > + struct folio *new_folio = (struct folio *)(head + i); > + > + (head + i)->mapping = NULL; > + (head + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; > + > + /* Overwrite compound_head with pgmap */ > + new_folio->pgmap = pgmap; > + } > + > + head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP; > + smp_wmb(); /* Changes but be visable before freeing folio > */ > + } > + > switch (pgmap->type) { > case MEMORY_DEVICE_PRIVATE: > case MEMORY_DEVICE_COHERENT: > It looks good to me, but I am very likely missing the detail on device private pages. Like Balbir pointed out above, for tail pages, calling free_tail_page_prepare() might be better to get sanity checks like normal large folio, although you will need to set ->pgmap after it. It is better to send it as a proper patch and get reviews from other MM folks. >>> + } >>> } >>> >>> void zone_device_page_init(struct page *page, unsigned int order) >> >> >> Otherwise, it seems like the right way to solve the issue. >> > > My question is: why isn’t Nouveau hitting this issue, or your Nvidia > out-of-tree driver (lack of testing, Xe's test suite coverage is quite > good at finding corners). > > Also, will this change in behavior break either ofthose drivers? > > Matt > >> Balbir Best Regards, Yan, Zi
