zone_device: Reinitialize large zone device private folios

Zi Yan Mon, 19 Jan 2026 12:09:44 -0800

On 19 Jan 2026, at 9:20, Jason Gunthorpe wrote:

> On Mon, Jan 19, 2026 at 04:59:56PM +1100, Alistair Popple wrote:
>> On 2026-01-17 at 16:27 +1100, Matthew Brost <[email protected]> 
>> wrote...
>>> On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
>>>> On 1/17/26 14:55, Matthew Brost wrote:
>>>>> On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
>>>>>> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
>>>>>>>> I suppose we could be getting say an order-9 folio that was previously 
>>>>>>>> used
>>>>>>>> as two order-8 folios? And each of them had their _nr_pages in their 
>>>>>>>> head
>>>>>>>
>>>>>>> Yes, this is a good example. At this point we have idea what previous
>>>>>>> allocation(s) order(s) were - we could have multiple places in the loop
>>>>>>> where _nr_pages is populated, thus we have to clear this everywhere.
>>>>>>
>>>>>> Why? The fact you have to use such a crazy expression to even access
>>>>>> _nr_pages strongly says nothing will read it as _nr_pages.
>>>>>>
>>>>>> Explain each thing:
>>>>>>
>>>>>>          new_page->flags.f &= ~0xffUL;   /* Clear possible order, page 
>>>>>> head */
>>>>>>
>>>>>> OK, the tail page flags need to be set right, and prep_compound_page()
>>>>>> called later depends on them being zero.
>>>>>>
>>>>>>          ((struct folio *)(new_page - 1))->_nr_pages = 0;
>>>>>>
>>>>>> Can't see a reason, nothing reads _nr_pages from a random tail
>>>>>> page. _nr_pages is the last 8 bytes of struct page so it overlaps
>>>>>> memcg_data, which is also not supposed to be read from a tail page?
>>
>> This is (or was) either a order-0 page, a head page or a tail page, who
>> knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
>> supposed to be read from a tail page or not. What really matters is does any 
>> of
>> vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
>> particular value when called on this page?
>
> This weird expression is doing three things,
> 1) it is zeroing memcg on the head page
> 2) it is zeroing _nr_pages on the head folio
> 3) it is zeroing memcg on all the tail pages.
>
> Are you aruging for 1, 2 or 3?
>
> #1 is missing today
> #2 is handled directly by the prep_compound_page() -> prep_compound_head() -> 
> folio_set_order()
> #3 I argue isn't necessary.
>
>> AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when 
>> called
>> on an order-0 page, which means it has to be cleared.
>
> Great, so lets write that in prep_compound_head()!
>
>> Although I think it would be far less confusing if it was just written like 
>> that
>> rather than the folio math but it achieves the same thing and is technically
>> correct.
>
> I have yet to hear a reason to do #3.
>
>>>>>>          new_folio->mapping = NULL;
>>>>>>
>>>>>> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = 
>>>>>> TAIL_MAPPING;
>>
>> Not pointless - vm_insert_page() for example expects folio_test_anon() which
>> which won't be the case if p->mapping was previously set to TAIL_MAPPING so 
>> it
>> needs to be cleared. migrate_vma_setup() has a similar issue.
>
> It is pointless to put it in the loop! Sure set the head page.
>
>>>>>>          new_folio->pgmap = pgmap;       /* Also clear compound head */
>>>>>>
>>>>>> Pointless, compound_head is set in prep_compound_tail(): 
>>>>>> set_compound_head(p, head);
>>
>> No it isn't - we're not clearing tail pages here, we're initialising 
>> ZONE_DEVICE
>> struct pages ready for use by the core-mm which means the pgmap needs to be
>> correct.
>
> See above, same issue. The tail pages have pgmap set to NULL because
> prep_compound_tail() does it. So why do we set it to pgmap here and
> then clear it a few lines below?
>
> Set it once in the head folio outside this loop.
>
>> No problem with the above, and FWIW it seems correct. Although I suspect just
>> setting page->memcg_data = 0 would have been far less controversial ;)
>
> It is "correct" but horrible.
>
> What is wrong with this? Isn't it so much better and more efficient??
>
> diff --git a/mm/internal.h b/mm/internal.h
> index e430da900430a1..a7d3f5e4b85e49 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -806,14 +806,21 @@ static inline void prep_compound_head(struct page 
> *page, unsigned int order)
>               atomic_set(&folio->_pincount, 0);
>               atomic_set(&folio->_entire_mapcount, -1);
>       }
> -     if (order > 1)
> +     if (order > 1) {
>               INIT_LIST_HEAD(&folio->_deferred_list);
> +     } else {
> +             folio->mapping = NULL;
> +#ifdef CONFIG_MEMCG
> +             folio->memcg_data = 0;
> +#endif
> +     }


prep_compound_head() is only called on >0 order pages. The above
code means when order == 1, folio->mapping and folio->memcg_data are
assigned NULL.

>  }
>
>  static inline void prep_compound_tail(struct page *head, int tail_idx)
>  {
>       struct page *p = head + tail_idx;
>
> +     p->flags.f &= ~0xffUL;  /* Clear possible order, page head */

No one cares about tail page flags if it is not checked in check_new_page()
from mm/page_alloc.c.

>       p->mapping = TAIL_MAPPING;
>       set_compound_head(p, head);
>       set_page_private(p, 0);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 4c2e0d68eb2798..7ec034c11068e1 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -479,19 +479,23 @@ void free_zone_device_folio(struct folio *folio)
>       }
>  }
>
> -void zone_device_page_init(struct page *page, unsigned int order)
> +void zone_device_page_init(struct page *page, struct dev_pagemap *pgmap,
> +                        unsigned int order)
>  {
>       VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
> +     struct folio *folio;
>
>       /*
>        * Drivers shouldn't be allocating pages after calling
>        * memunmap_pages().
>        */
>       WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << 
> order));
> -     set_page_count(page, 1);
> -     lock_page(page);
>
> -     if (order)
> -             prep_compound_page(page, order);
> +     prep_compound_page(page, order);

prep_compound_page() should only be called for >0 order pages. This creates
another weirdness in device pages by assuming all pages are compound.

> +
> +     folio = page_folio(page);
> +     folio->pgmap = pgmap;
> +     folio_lock(folio);
> +     folio_set_count(folio, 1);

/* clear possible previous page->mapping */
folio->mapping = NULL;

/* clear possible previous page->_nr_pages */
#ifdef CONFIG_MEMCG
        folio->memcg_data = 0;
#endif

With two above and still call prep_compound_page() only when order > 0,
the code should work. There is no need to change prep_compoun_*()
functions.

>  }
>  EXPORT_SYMBOL_GPL(zone_device_page_init);


This patch mixed the concept of page and folio together, thus
causing confusion. Core MM sees page and folio two separate things:
1. page is the smallest internal physical memory management unit,
2. folio is an abstraction on top of pages, and other abstractions can be
   slab, ptdesc, and more (https://kernelnewbies.org/MatthewWilcox/Memdescs).

Compound page is a high-order page that all subpages are managed as a whole,
but it is converted to folio after page_rmappable_folio() (see
__folio_alloc_noprof()). And a slab page can be a compound page too (see
page_slab() does compound_head() like operation). So a compound page is
not the same as a folio.

I can see folio is used in prep_compound_head()
and think it is confusing, since these pages should not be regarded as
a folio yet. I probably blame willy (cc'd), since he started it from commit
94688e8eb453 ("mm: remove folio_pincount_ptr() and head_compound_pincount()")
and before that prep_compound_head() was all about pages. folio_set_order()
was set_compound_order() before commit 1e3be4856f49d ("mm/folio: replace
set_compound_order with folio_set_order").

If device pages have to initialize on top of pages with obsolete states,
at least it should be first initialized as pages, then as folios to avoid
confusion.


--
Best Regards,
Yan, Zi

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Reply via email to