migrate: Add migrate_device_split_page

Matthew Brost Wed, 07 Jan 2026 19:42:53 -0800

On Thu, Jan 08, 2026 at 02:14:28PM +1100, Alistair Popple wrote:
> On 2026-01-08 at 13:53 +1100, Zi Yan <[email protected]> wrote...
> > On 7 Jan 2026, at 21:17, Matthew Brost wrote:
> > 
> > > On Thu, Jan 08, 2026 at 11:56:03AM +1100, Balbir Singh wrote:
> > >> On 1/8/26 08:03, Zi Yan wrote:
> > >>> On 7 Jan 2026, at 16:15, Matthew Brost wrote:
> > >>>
> > >>>> On Wed, Jan 07, 2026 at 03:38:35PM -0500, Zi Yan wrote:
> > >>>>> On 7 Jan 2026, at 15:20, Zi Yan wrote:
> > >>>>>
> > >>>>>> +THP folks
> > >>>>>
> > >>>>> +willy, since he commented in another thread.
> > >>>>>
> > >>>>>>
> > >>>>>> On 16 Dec 2025, at 15:10, Francois Dugast wrote:
> > >>>>>>
> > >>>>>>> From: Matthew Brost <[email protected]>
> > >>>>>>>
> > >>>>>>> Introduce migrate_device_split_page() to split a device page into
> > >>>>>>> lower-order pages. Used when a folio allocated as higher-order is 
> > >>>>>>> freed
> > >>>>>>> and later reallocated at a smaller order by the driver memory 
> > >>>>>>> manager.
> > >>>>>>>
> > >>>>>>> Cc: Andrew Morton <[email protected]>
> > >>>>>>> Cc: Balbir Singh <[email protected]>
> > >>>>>>> Cc: [email protected]
> > >>>>>>> Cc: [email protected]
> > >>>>>>> Signed-off-by: Matthew Brost <[email protected]>
> > >>>>>>> Signed-off-by: Francois Dugast <[email protected]>
> > >>>>>>> ---
> > >>>>>>>  include/linux/huge_mm.h |  3 +++
> > >>>>>>>  include/linux/migrate.h |  1 +
> > >>>>>>>  mm/huge_memory.c        |  6 ++---
> > >>>>>>>  mm/migrate_device.c     | 49 
> > >>>>>>> +++++++++++++++++++++++++++++++++++++++++
> > >>>>>>>  4 files changed, 56 insertions(+), 3 deletions(-)
> > >>>>>>>
> > >>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > >>>>>>> index a4d9f964dfde..6ad8f359bc0d 100644
> > >>>>>>> --- a/include/linux/huge_mm.h
> > >>>>>>> +++ b/include/linux/huge_mm.h
> > >>>>>>> @@ -374,6 +374,9 @@ int __split_huge_page_to_list_to_order(struct 
> > >>>>>>> page *page, struct list_head *list
> > >>>>>>>  int folio_split_unmapped(struct folio *folio, unsigned int 
> > >>>>>>> new_order);
> > >>>>>>>  unsigned int min_order_for_split(struct folio *folio);
> > >>>>>>>  int split_folio_to_list(struct folio *folio, struct list_head 
> > >>>>>>> *list);
> > >>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
> > >>>>>>> +                      struct page *split_at, struct xa_state *xas,
> > >>>>>>> +                      struct address_space *mapping, enum 
> > >>>>>>> split_type split_type);
> > >>>>>>>  int folio_check_splittable(struct folio *folio, unsigned int 
> > >>>>>>> new_order,
> > >>>>>>>                        enum split_type split_type);
> > >>>>>>>  int folio_split(struct folio *folio, unsigned int new_order, 
> > >>>>>>> struct page *page,
> > >>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > >>>>>>> index 26ca00c325d9..ec65e4fd5f88 100644
> > >>>>>>> --- a/include/linux/migrate.h
> > >>>>>>> +++ b/include/linux/migrate.h
> > >>>>>>> @@ -192,6 +192,7 @@ void migrate_device_pages(unsigned long 
> > >>>>>>> *src_pfns, unsigned long *dst_pfns,
> > >>>>>>>                     unsigned long npages);
> > >>>>>>>  void migrate_device_finalize(unsigned long *src_pfns,
> > >>>>>>>                     unsigned long *dst_pfns, unsigned long npages);
> > >>>>>>> +int migrate_device_split_page(struct page *page);
> > >>>>>>>
> > >>>>>>>  #endif /* CONFIG_MIGRATION */
> > >>>>>>>
> > >>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > >>>>>>> index 40cf59301c21..7ded35a3ecec 100644
> > >>>>>>> --- a/mm/huge_memory.c
> > >>>>>>> +++ b/mm/huge_memory.c
> > >>>>>>> @@ -3621,9 +3621,9 @@ static void __split_folio_to_order(struct 
> > >>>>>>> folio *folio, int old_order,
> > >>>>>>>   * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, 
> > >>>>>>> @folio might be
> > >>>>>>>   * split but not to @new_order, the caller needs to check)
> > >>>>>>>   */
> > >>>>>>> -static int __split_unmapped_folio(struct folio *folio, int 
> > >>>>>>> new_order,
> > >>>>>>> -           struct page *split_at, struct xa_state *xas,
> > >>>>>>> -           struct address_space *mapping, enum split_type 
> > >>>>>>> split_type)
> > >>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
> > >>>>>>> +                      struct page *split_at, struct xa_state *xas,
> > >>>>>>> +                      struct address_space *mapping, enum 
> > >>>>>>> split_type split_type)
> > >>>>>>>  {
> > >>>>>>>     const bool is_anon = folio_test_anon(folio);
> > >>>>>>>     int old_order = folio_order(folio);
> > >>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > >>>>>>> index 23379663b1e1..eb0f0e938947 100644
> > >>>>>>> --- a/mm/migrate_device.c
> > >>>>>>> +++ b/mm/migrate_device.c
> > >>>>>>> @@ -775,6 +775,49 @@ int migrate_vma_setup(struct migrate_vma *args)
> > >>>>>>>  EXPORT_SYMBOL(migrate_vma_setup);
> > >>>>>>>
> > >>>>>>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > >>>>>>> +/**
> > >>>>>>> + * migrate_device_split_page() - Split device page
> > >>>>>>> + * @page: Device page to split
> > >>>>>>> + *
> > >>>>>>> + * Splits a device page into smaller pages. Typically called when 
> > >>>>>>> reallocating a
> > >>>>>>> + * folio to a smaller size. Inherently racy—only safe if the 
> > >>>>>>> caller ensures
> > >>>>>>> + * mutual exclusion within the page's folio (i.e., no other 
> > >>>>>>> threads are using
> > >>>>>>> + * pages within the folio). Expected to be called a free device 
> > >>>>>>> page and
> > >>>>>>> + * restores all split out pages to a free state.
> > >>>>>>> + */
> > >>>>>
> > >>>>> Do you mind explaining why __split_unmapped_folio() is needed for a 
> > >>>>> free device
> > >>>>> page? A free page is not supposed to be a large folio, at least from 
> > >>>>> a core
> > >>>>> MM point of view. __split_unmapped_folio() is intended to work on 
> > >>>>> large folios
> > >>>>> (or compound pages), even if the input folio has refcount == 0 
> > >>>>> (because it is
> > >>>>> frozen).
> > >>>>>
> > >>>>
> > >>>> Well, then maybe this is a bug in core MM where the freed page is still
> > >>>> a THP. Let me explain the scenario and why this is needed from my POV.
> > >>>>
> > >>>> Our VRAM allocator in Xe (and several other DRM drivers) is DRM buddy.
> > >>>> This is a shared pool between traditional DRM GEMs (buffer objects) and
> > >>>> SVM allocations (pages). It doesn’t have any view of the page 
> > >>>> backing—it
> > >>>> basically just hands back a pointer to VRAM space that we allocate 
> > >>>> from.
> > >>>> From that, if it’s an SVM allocation, we can derive the device pages.
> > >>>>
> > >>>> What I see happening is: a 2M buddy allocation occurs, we make the
> > >>>> backing device pages a large folio, and sometime later the folio
> > >>>> refcount goes to zero and we free the buddy allocation. Later, the 
> > >>>> buddy
> > >>>> allocation is reused for a smaller allocation (e.g., 4K or 64K), but 
> > >>>> the
> > >>>> backing pages are still a large folio. Here is where we need to split
> > >>>
> > >>> I agree with you that it might be a bug in free_zone_device_folio() 
> > >>> based
> > >>> on my understanding. Since zone_device_page_init() calls 
> > >>> prep_compound_page()
> > >>> for >0 orders, but free_zone_device_folio() never reverse the process.
> > >>>
> > >>> Balbir and Alistair might be able to help here.
> 
> Just catching up after the Christmas break.
>


I think everyone is and scrambling for the release PR. :)

> > >>
> > >> I agree it's an API limitation
> > 
> > I am not sure. If free_zone_device_folio() does not get rid of large folio
> > metadata, there is no guarantee that a freed large device private folio will
> > be reallocated as a large device private folio. And when mTHP support is
> > added, the folio order might change too. That can cause issues when
> > compound_head() is called on a tail page of a previously large folio, since
> > compound_head() will return the old head page instead of the tail page 
> > itself.
> 
> I agree that freeing the device folio should get rid of the large folio. That
> would also keep it consistent with what we do for FS DAX for example.
> 

+1

> > >>
> > >>>
> > >>> I cherry picked the code from __free_frozen_pages() to reverse the 
> > >>> process.
> > >>> Can you give it a try to see if it solve the above issue? Thanks.
> 
> It would be nice if this could be a common helper for freeing compound
> ZONE_DEVICE pages. FS DAX already has this for example:
> 
> static inline unsigned long dax_folio_put(struct folio *folio)
> {
>       unsigned long ref;
>       int order, i;
> 
>       if (!dax_folio_is_shared(folio))
>               ref = 0;
>       else
>               ref = --folio->share;
> 
>       if (ref)
>               return ref;
> 
>       folio->mapping = NULL;
>       order = folio_order(folio);
>       if (!order)
>               return 0;
>       folio_reset_order(folio);
> 
>       for (i = 0; i < (1UL << order); i++) {
>               struct dev_pagemap *pgmap = page_pgmap(&folio->page);
>               struct page *page = folio_page(folio, i);
>               struct folio *new_folio = (struct folio *)page;
> 
>               ClearPageHead(page);
>               clear_compound_head(page);
> 
>               new_folio->mapping = NULL;
>               /*
>                * Reset pgmap which was over-written by
>                * prep_compound_page().
>                */
>               new_folio->pgmap = pgmap;
>               new_folio->share = 0;
>               WARN_ON_ONCE(folio_ref_count(new_folio));
>       }
> 
>       return ref;
> }
> 
> Aside from the weird refcount checks that FS DAX needs to at the start of this
> function I don't think there is anything specific to DEVICE_PRIVATE pages 
> there.
> 

Thanks for the reference, Alistair. This looks roughly like what I
hacked together in an effort to just get something working. I believe a
common helper can be made to work. Let me churn on this tomorrow and put
together a proper patch.

> > >>>
> > >>> From 3aa03baa39b7e62ea079e826de6ed5aab3061e46 Mon Sep 17 00:00:00 2001
> > >>> From: Zi Yan <[email protected]>
> > >>> Date: Wed, 7 Jan 2026 16:49:52 -0500
> > >>> Subject: [PATCH] mm/memremap: free device private folio fix
> > >>> Content-Type: text/plain; charset="utf-8"
> > >>>
> > >>> Signed-off-by: Zi Yan <[email protected]>
> > >>> ---
> > >>>  mm/memremap.c | 15 +++++++++++++++
> > >>>  1 file changed, 15 insertions(+)
> > >>>
> > >>> diff --git a/mm/memremap.c b/mm/memremap.c
> > >>> index 63c6ab4fdf08..483666ff7271 100644
> > >>> --- a/mm/memremap.c
> > >>> +++ b/mm/memremap.c
> > >>> @@ -475,6 +475,21 @@ void free_zone_device_folio(struct folio *folio)
> > >>>                 pgmap->ops->folio_free(folio);
> > >>>                 break;
> > >>>         }
> > >>> +
> > >>> +       if (nr > 1) {
> > >>> +               struct page *head = folio_page(folio, 0);
> > >>> +
> > >>> +               head[1].flags.f &= ~PAGE_FLAGS_SECOND;
> > >>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > >>> +               folio->_nr_pages = 0;
> > >>> +#endif
> > >>> +               for (i = 1; i < nr; i++) {
> > >>> +                       (head + i)->mapping = NULL;
> > >>> +                       clear_compound_head(head + i);
> > >>
> > >> I see that your skipping the checks in free_page_tail_prepare()? IIUC, 
> > >> we should be able
> > >> to invoke it even for zone device private pages
> > 
> > I am not sure about what part of compound page is also used in device 
> > private folio.
> > Yes, it is better to add right checks.
> > 
> > >>
> > >>> +               }
> > >>> +               folio->mapping = NULL;
> > >>
> > >> This is already done in free_zone_device_folio()
> > >>
> > >>> +               head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >>
> > >> I don't think this is required for zone device private folios, but I 
> > >> suppose it
> > >> keeps the code generic
> > >>
> > >
> > > Well, the above code doesn’t work, but I think it’s the right idea.
> > > clear_compound_head aliases to pgmap, which we don’t want to be NULL. I
> > 
> > Thank you for pointing it out. I am not familiar with device private page 
> > code.
> > 
> > > believe the individual pages likely need their flags cleared (?), and
> > 
> > Yes, I missed the tail page flag clearing part.
> > 

I think the page head is the only thing that really needs to be cleared,
though I could be wrong.

> > > this step must be done before calling folio_free and include a barrier,
> > > as the page can be immediately reallocated.
> > >
> > > So here’s what I came up with, and it seems to work (for Xe, GPU SVM):
> > >
> > >  mm/memremap.c | 21 +++++++++++++++++++++
> > >  1 file changed, 21 insertions(+)
> > >
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 63c6ab4fdf08..ac20abb6a441 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -448,6 +448,27 @@ void free_zone_device_folio(struct folio *folio)
> > >             pgmap->type != MEMORY_DEVICE_GENERIC)
> > >                 folio->mapping = NULL;
> > >
> > > +       if (nr > 1) {
> > > +               struct page *head = folio_page(folio, 0);
> > > +
> > > +               head[1].flags.f &= ~PAGE_FLAGS_SECOND;
> > > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > > +               folio->_nr_pages = 0;
> > > +#endif
> > > +               for (i = 1; i < nr; i++) {
> > > +                       struct folio *new_folio = (struct folio *)(head + 
> > > i);
> > > +
> > > +                       (head + i)->mapping = NULL;
> > > +                       (head + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > +
> > > +                       /* Overwrite compound_head with pgmap */
> > > +                       new_folio->pgmap = pgmap;
> > > +               }
> > > +
> > > +               head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > +               smp_wmb();        /* Changes but be visable before 
> > > freeing folio */
> > > +       }
> > > +
> > >         switch (pgmap->type) {
> > >         case MEMORY_DEVICE_PRIVATE:
> > >         case MEMORY_DEVICE_COHERENT:
> > >
> > 
> > It looks good to me, but I am very likely missing the detail on device 
> > private
> > pages. Like Balbir pointed out above, for tail pages, calling
> > free_tail_page_prepare() might be better to get sanity checks like normal
> > large folio, although you will need to set ->pgmap after it.
> > 
> > It is better to send it as a proper patch and get reviews from other
> > MM folks.
> > 

Yes, agreed. See above—I’ll work on a proper patch tomorrow and CC all
the correct MM folks. Aiming to have something ready for the next
release PR.

Matt

> > >>> +       }
> > >>>  }
> > >>>
> > >>>  void zone_device_page_init(struct page *page, unsigned int order)
> > >>
> > >>
> > >> Otherwise, it seems like the right way to solve the issue.
> > >>
> > >
> > > My question is: why isn’t Nouveau hitting this issue, or your Nvidia
> > > out-of-tree driver (lack of testing, Xe's test suite coverage is quite
> > > good at finding corners).
> > >
> > > Also, will this change in behavior break either ofthose drivers?
> > >
> > > Matt
> > >
> > >> Balbir
> > 
> > 
> > Best Regards,
> > Yan, Zi

Re: [PATCH 1/4] mm/migrate: Add migrate_device_split_page

Reply via email to