zone_device: support large zone device private folios

David Hildenbrand Sat, 18 Oct 2025 11:39:03 -0700

On 18.09.25 04:49, Zi Yan wrote:

On 16 Sep 2025, at 8:21, Balbir Singh wrote:

Add routines to support allocation of large order zone device folios
and helper functions for zone device folios, to check if a folio is
device private and helpers for setting zone device data.

When large folios are used, the existing page_free() callback in
pgmap is called when the folio is freed, this is true for both
PAGE_SIZE and higher order pages.

Zone device private large folios do not support deferred split and
scan like normal THP folios.

Signed-off-by: Balbir Singh <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: Joshua Hahn <[email protected]>
Cc: Rakie Kim <[email protected]>
Cc: Byungchul Park <[email protected]>
Cc: Gregory Price <[email protected]>
Cc: Ying Huang <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: "Liam R. Howlett" <[email protected]>
Cc: Nico Pache <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Lyude Paul <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Simona Vetter <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Mika Penttilä <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Francois Dugast <[email protected]>
---
  include/linux/memremap.h | 10 +++++++++-
  mm/memremap.c            | 34 +++++++++++++++++++++-------------
  mm/rmap.c                |  6 +++++-
  3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5951ba12a28..9c20327c2be5 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page *page)
  }

  #ifdef CONFIG_ZONE_DEVICE
-void zone_device_page_init(struct page *page);
+void zone_device_folio_init(struct folio *folio, unsigned int order);
  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
  void memunmap_pages(struct dev_pagemap *pgmap);
  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
@@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn);
  bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);

  unsigned long memremap_compat_align(void);
+
+static inline void zone_device_page_init(struct page *page)
+{
+       struct folio *folio = page_folio(page);
+
+       zone_device_folio_init(folio, 0);


I assume it is for legacy code, where only non-compound page exists?

It seems that you assume @page is always order-0, but there is no check
for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio)
above it would be useful to detect misuse.

+}
+
  #else
  static inline void *devm_memremap_pages(struct device *dev,
                struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index 46cb1b0b6f72..a8481ebf94cc 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
  void free_zone_device_folio(struct folio *folio)
  {
        struct dev_pagemap *pgmap = folio->pgmap;
+       unsigned long nr = folio_nr_pages(folio);
+       int i;

        if (WARN_ON_ONCE(!pgmap))
                return;

        mem_cgroup_uncharge(folio);

-       /*
-        * Note: we don't expect anonymous compound pages yet. Once supported
-        * and we could PTE-map them similar to THP, we'd have to clear
-        * PG_anon_exclusive on all tail pages.
-        */
        if (folio_test_anon(folio)) {
-               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-               __ClearPageAnonExclusive(folio_page(folio, 0));
+               for (i = 0; i < nr; i++)
+                       __ClearPageAnonExclusive(folio_page(folio, i));
+       } else {
+               VM_WARN_ON_ONCE(folio_test_large(folio));
        }

        /*
@@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio)
        case MEMORY_DEVICE_COHERENT:
                if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free))
                        break;
-               pgmap->ops->page_free(folio_page(folio, 0));
-               put_dev_pagemap(pgmap);
+               pgmap->ops->page_free(&folio->page);
+               percpu_ref_put_many(&folio->pgmap->ref, nr);
                break;

        case MEMORY_DEVICE_GENERIC:
@@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio)
        }
  }

-void zone_device_page_init(struct page *page)
+void zone_device_folio_init(struct folio *folio, unsigned int order)
  {
+       struct page *page = folio_page(folio, 0);


It is strange to see a folio is converted back to page in
a function called zone_device_folio_init().

+
+       VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);
+
        /*
         * Drivers shouldn't be allocating pages after calling
         * memunmap_pages().
         */
-       WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
-       set_page_count(page, 1);
+       WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << 
order));
+       folio_set_count(folio, 1);
        lock_page(page);
+
+       if (order > 1) {
+               prep_compound_page(page, order);
+               folio_set_large_rmappable(folio);
+       }


OK, so basically, @folio is not a compound page yet when 
zone_device_folio_init()
is called.

I feel that your zone_device_page_init() and zone_device_folio_init()
implementations are inverse. They should follow the same pattern
as __alloc_pages_noprof() and __folio_alloc_noprof(), where
zone_device_page_init() does the actual initialization and
zone_device_folio_init() just convert a page to folio.

Something like:

void zone_device_page_init(struct page *page, unsigned int order)
{
        VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES);

        /*
         * Drivers shouldn't be allocating pages after calling
         * memunmap_pages().
         */

     WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order));
        
        /*
         * anonymous folio does not support order-1, high order file-backed 
folio
         * is not supported at all.
         */
        VM_WARN_ON_ONCE(order == 1);

        if (order > 1)
                prep_compound_page(page, order);

        /* page has to be compound head here */
        set_page_count(page, 1);
        lock_page(page);
}

void zone_device_folio_init(struct folio *folio, unsigned int order)
{
        struct page *page = folio_page(folio, 0);

        zone_device_page_init(page, order);
        page_rmappable_folio(page);
}

Or

struct folio *zone_device_folio_init(struct page *page, unsigned int order)
{
        zone_device_page_init(page, order);
        return page_rmappable_folio(page);
}

I think the problem is that it will all be weird once we dynamicallyallocate "struct folio".


I have not yet a clear understanding on how that would really work.

For example, should it be pgmap->ops->page_folio() ?

Who allocates the folio? Do we allocate all order-0 folios initially, tothen merge them when constructing large folios? How do we manage the"struct folio" during such merging splitting?

With that in mind, I don't really know what the proper interface shouldbe today.



zone_device_folio_init(struct page *page, unsigned int order)

looks cleaner, agreed.



Then, it comes to free_zone_device_folio() above,
I feel that pgmap->ops->page_free() should take an additional order
parameter to free a compound page like free_frozen_pages().

IIRC free_frozen_pages() does not operate on compound pages. If we knowthat we are operating on a compound page (or single page) then passingin the page (or better the folio) should work.


--
Cheers

David / dhildenb

Re: [v6 01/15] mm/zone_device: support large zone device private folios

Reply via email to