migrate: Add migrate_device_split_page

Zi Yan Wed, 07 Jan 2026 18:54:10 -0800

On 7 Jan 2026, at 21:17, Matthew Brost wrote:

> On Thu, Jan 08, 2026 at 11:56:03AM +1100, Balbir Singh wrote:
>> On 1/8/26 08:03, Zi Yan wrote:
>>> On 7 Jan 2026, at 16:15, Matthew Brost wrote:
>>>
>>>> On Wed, Jan 07, 2026 at 03:38:35PM -0500, Zi Yan wrote:
>>>>> On 7 Jan 2026, at 15:20, Zi Yan wrote:
>>>>>
>>>>>> +THP folks
>>>>>
>>>>> +willy, since he commented in another thread.
>>>>>
>>>>>>
>>>>>> On 16 Dec 2025, at 15:10, Francois Dugast wrote:
>>>>>>
>>>>>>> From: Matthew Brost <[email protected]>
>>>>>>>
>>>>>>> Introduce migrate_device_split_page() to split a device page into
>>>>>>> lower-order pages. Used when a folio allocated as higher-order is freed
>>>>>>> and later reallocated at a smaller order by the driver memory manager.
>>>>>>>
>>>>>>> Cc: Andrew Morton <[email protected]>
>>>>>>> Cc: Balbir Singh <[email protected]>
>>>>>>> Cc: [email protected]
>>>>>>> Cc: [email protected]
>>>>>>> Signed-off-by: Matthew Brost <[email protected]>
>>>>>>> Signed-off-by: Francois Dugast <[email protected]>
>>>>>>> ---
>>>>>>>  include/linux/huge_mm.h |  3 +++
>>>>>>>  include/linux/migrate.h |  1 +
>>>>>>>  mm/huge_memory.c        |  6 ++---
>>>>>>>  mm/migrate_device.c     | 49 +++++++++++++++++++++++++++++++++++++++++
>>>>>>>  4 files changed, 56 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>> index a4d9f964dfde..6ad8f359bc0d 100644
>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>> @@ -374,6 +374,9 @@ int __split_huge_page_to_list_to_order(struct page 
>>>>>>> *page, struct list_head *list
>>>>>>>  int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>>>>>>  unsigned int min_order_for_split(struct folio *folio);
>>>>>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>> +                          struct page *split_at, struct xa_state *xas,
>>>>>>> +                          struct address_space *mapping, enum 
>>>>>>> split_type split_type);
>>>>>>>  int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>>>>>>                            enum split_type split_type);
>>>>>>>  int folio_split(struct folio *folio, unsigned int new_order, struct 
>>>>>>> page *page,
>>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>>>>>>> index 26ca00c325d9..ec65e4fd5f88 100644
>>>>>>> --- a/include/linux/migrate.h
>>>>>>> +++ b/include/linux/migrate.h
>>>>>>> @@ -192,6 +192,7 @@ void migrate_device_pages(unsigned long *src_pfns, 
>>>>>>> unsigned long *dst_pfns,
>>>>>>>                         unsigned long npages);
>>>>>>>  void migrate_device_finalize(unsigned long *src_pfns,
>>>>>>>                         unsigned long *dst_pfns, unsigned long npages);
>>>>>>> +int migrate_device_split_page(struct page *page);
>>>>>>>
>>>>>>>  #endif /* CONFIG_MIGRATION */
>>>>>>>
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index 40cf59301c21..7ded35a3ecec 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -3621,9 +3621,9 @@ static void __split_folio_to_order(struct folio 
>>>>>>> *folio, int old_order,
>>>>>>>   * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio 
>>>>>>> might be
>>>>>>>   * split but not to @new_order, the caller needs to check)
>>>>>>>   */
>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>> -               struct page *split_at, struct xa_state *xas,
>>>>>>> -               struct address_space *mapping, enum split_type 
>>>>>>> split_type)
>>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>> +                          struct page *split_at, struct xa_state *xas,
>>>>>>> +                          struct address_space *mapping, enum 
>>>>>>> split_type split_type)
>>>>>>>  {
>>>>>>>         const bool is_anon = folio_test_anon(folio);
>>>>>>>         int old_order = folio_order(folio);
>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>> index 23379663b1e1..eb0f0e938947 100644
>>>>>>> --- a/mm/migrate_device.c
>>>>>>> +++ b/mm/migrate_device.c
>>>>>>> @@ -775,6 +775,49 @@ int migrate_vma_setup(struct migrate_vma *args)
>>>>>>>  EXPORT_SYMBOL(migrate_vma_setup);
>>>>>>>
>>>>>>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>>>>>>> +/**
>>>>>>> + * migrate_device_split_page() - Split device page
>>>>>>> + * @page: Device page to split
>>>>>>> + *
>>>>>>> + * Splits a device page into smaller pages. Typically called when 
>>>>>>> reallocating a
>>>>>>> + * folio to a smaller size. Inherently racy—only safe if the caller 
>>>>>>> ensures
>>>>>>> + * mutual exclusion within the page's folio (i.e., no other threads 
>>>>>>> are using
>>>>>>> + * pages within the folio). Expected to be called a free device page 
>>>>>>> and
>>>>>>> + * restores all split out pages to a free state.
>>>>>>> + */
>>>>>
>>>>> Do you mind explaining why __split_unmapped_folio() is needed for a free 
>>>>> device
>>>>> page? A free page is not supposed to be a large folio, at least from a 
>>>>> core
>>>>> MM point of view. __split_unmapped_folio() is intended to work on large 
>>>>> folios
>>>>> (or compound pages), even if the input folio has refcount == 0 (because 
>>>>> it is
>>>>> frozen).
>>>>>
>>>>
>>>> Well, then maybe this is a bug in core MM where the freed page is still
>>>> a THP. Let me explain the scenario and why this is needed from my POV.
>>>>
>>>> Our VRAM allocator in Xe (and several other DRM drivers) is DRM buddy.
>>>> This is a shared pool between traditional DRM GEMs (buffer objects) and
>>>> SVM allocations (pages). It doesn’t have any view of the page backing—it
>>>> basically just hands back a pointer to VRAM space that we allocate from.
>>>> From that, if it’s an SVM allocation, we can derive the device pages.
>>>>
>>>> What I see happening is: a 2M buddy allocation occurs, we make the
>>>> backing device pages a large folio, and sometime later the folio
>>>> refcount goes to zero and we free the buddy allocation. Later, the buddy
>>>> allocation is reused for a smaller allocation (e.g., 4K or 64K), but the
>>>> backing pages are still a large folio. Here is where we need to split
>>>
>>> I agree with you that it might be a bug in free_zone_device_folio() based
>>> on my understanding. Since zone_device_page_init() calls 
>>> prep_compound_page()
>>> for >0 orders, but free_zone_device_folio() never reverse the process.
>>>
>>> Balbir and Alistair might be able to help here.
>>
>> I agree it's an API limitation


I am not sure. If free_zone_device_folio() does not get rid of large folio
metadata, there is no guarantee that a freed large device private folio will
be reallocated as a large device private folio. And when mTHP support is
added, the folio order might change too. That can cause issues when
compound_head() is called on a tail page of a previously large folio, since
compound_head() will return the old head page instead of the tail page itself.

>>
>>>
>>> I cherry picked the code from __free_frozen_pages() to reverse the process.
>>> Can you give it a try to see if it solve the above issue? Thanks.
>>>
>>> From 3aa03baa39b7e62ea079e826de6ed5aab3061e46 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <[email protected]>
>>> Date: Wed, 7 Jan 2026 16:49:52 -0500
>>> Subject: [PATCH] mm/memremap: free device private folio fix
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Signed-off-by: Zi Yan <[email protected]>
>>> ---
>>>  mm/memremap.c | 15 +++++++++++++++
>>>  1 file changed, 15 insertions(+)
>>>
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 63c6ab4fdf08..483666ff7271 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -475,6 +475,21 @@ void free_zone_device_folio(struct folio *folio)
>>>             pgmap->ops->folio_free(folio);
>>>             break;
>>>     }
>>> +
>>> +   if (nr > 1) {
>>> +           struct page *head = folio_page(folio, 0);
>>> +
>>> +           head[1].flags.f &= ~PAGE_FLAGS_SECOND;
>>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
>>> +           folio->_nr_pages = 0;
>>> +#endif
>>> +           for (i = 1; i < nr; i++) {
>>> +                   (head + i)->mapping = NULL;
>>> +                   clear_compound_head(head + i);
>>
>> I see that your skipping the checks in free_page_tail_prepare()? IIUC, we 
>> should be able
>> to invoke it even for zone device private pages

I am not sure about what part of compound page is also used in device private 
folio.
Yes, it is better to add right checks.

>>
>>> +           }
>>> +           folio->mapping = NULL;
>>
>> This is already done in free_zone_device_folio()
>>
>>> +           head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>>
>> I don't think this is required for zone device private folios, but I suppose 
>> it
>> keeps the code generic
>>
>
> Well, the above code doesn’t work, but I think it’s the right idea.
> clear_compound_head aliases to pgmap, which we don’t want to be NULL. I

Thank you for pointing it out. I am not familiar with device private page code.

> believe the individual pages likely need their flags cleared (?), and

Yes, I missed the tail page flag clearing part.

> this step must be done before calling folio_free and include a barrier,
> as the page can be immediately reallocated.
>
> So here’s what I came up with, and it seems to work (for Xe, GPU SVM):
>
>  mm/memremap.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 63c6ab4fdf08..ac20abb6a441 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -448,6 +448,27 @@ void free_zone_device_folio(struct folio *folio)
>             pgmap->type != MEMORY_DEVICE_GENERIC)
>                 folio->mapping = NULL;
>
> +       if (nr > 1) {
> +               struct page *head = folio_page(folio, 0);
> +
> +               head[1].flags.f &= ~PAGE_FLAGS_SECOND;
> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> +               folio->_nr_pages = 0;
> +#endif
> +               for (i = 1; i < nr; i++) {
> +                       struct folio *new_folio = (struct folio *)(head + i);
> +
> +                       (head + i)->mapping = NULL;
> +                       (head + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +
> +                       /* Overwrite compound_head with pgmap */
> +                       new_folio->pgmap = pgmap;
> +               }
> +
> +               head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +               smp_wmb();    /* Changes but be visable before freeing folio 
> */
> +       }
> +
>         switch (pgmap->type) {
>         case MEMORY_DEVICE_PRIVATE:
>         case MEMORY_DEVICE_COHERENT:
>

It looks good to me, but I am very likely missing the detail on device private
pages. Like Balbir pointed out above, for tail pages, calling
free_tail_page_prepare() might be better to get sanity checks like normal
large folio, although you will need to set ->pgmap after it.

It is better to send it as a proper patch and get reviews from other
MM folks.

>>> +   }
>>>  }
>>>
>>>  void zone_device_page_init(struct page *page, unsigned int order)
>>
>>
>> Otherwise, it seems like the right way to solve the issue.
>>
>
> My question is: why isn’t Nouveau hitting this issue, or your Nvidia
> out-of-tree driver (lack of testing, Xe's test suite coverage is quite
> good at finding corners).
>
> Also, will this change in behavior break either ofthose drivers?
>
> Matt
>
>> Balbir


Best Regards,
Yan, Zi

Re: [PATCH 1/4] mm/migrate: Add migrate_device_split_page

Reply via email to