On 24 Sep 2025, at 7:04, David Hildenbrand wrote: > On 23.09.25 05:47, Balbir Singh wrote: >> On 9/19/25 23:26, Zi Yan wrote: >>> On 19 Sep 2025, at 1:01, Balbir Singh wrote: >>> >>>> On 9/18/25 12:49, Zi Yan wrote: >>>>> On 16 Sep 2025, at 8:21, Balbir Singh wrote: >>>>> >>>>>> Add routines to support allocation of large order zone device folios >>>>>> and helper functions for zone device folios, to check if a folio is >>>>>> device private and helpers for setting zone device data. >>>>>> >>>>>> When large folios are used, the existing page_free() callback in >>>>>> pgmap is called when the folio is freed, this is true for both >>>>>> PAGE_SIZE and higher order pages. >>>>>> >>>>>> Zone device private large folios do not support deferred split and >>>>>> scan like normal THP folios. >>>>>> >>>>>> Signed-off-by: Balbir Singh <[email protected]> >>>>>> Cc: David Hildenbrand <[email protected]> >>>>>> Cc: Zi Yan <[email protected]> >>>>>> Cc: Joshua Hahn <[email protected]> >>>>>> Cc: Rakie Kim <[email protected]> >>>>>> Cc: Byungchul Park <[email protected]> >>>>>> Cc: Gregory Price <[email protected]> >>>>>> Cc: Ying Huang <[email protected]> >>>>>> Cc: Alistair Popple <[email protected]> >>>>>> Cc: Oscar Salvador <[email protected]> >>>>>> Cc: Lorenzo Stoakes <[email protected]> >>>>>> Cc: Baolin Wang <[email protected]> >>>>>> Cc: "Liam R. Howlett" <[email protected]> >>>>>> Cc: Nico Pache <[email protected]> >>>>>> Cc: Ryan Roberts <[email protected]> >>>>>> Cc: Dev Jain <[email protected]> >>>>>> Cc: Barry Song <[email protected]> >>>>>> Cc: Lyude Paul <[email protected]> >>>>>> Cc: Danilo Krummrich <[email protected]> >>>>>> Cc: David Airlie <[email protected]> >>>>>> Cc: Simona Vetter <[email protected]> >>>>>> Cc: Ralph Campbell <[email protected]> >>>>>> Cc: Mika Penttilä <[email protected]> >>>>>> Cc: Matthew Brost <[email protected]> >>>>>> Cc: Francois Dugast <[email protected]> >>>>>> --- >>>>>> include/linux/memremap.h | 10 +++++++++- >>>>>> mm/memremap.c | 34 +++++++++++++++++++++------------- >>>>>> mm/rmap.c | 6 +++++- >>>>>> 3 files changed, 35 insertions(+), 15 deletions(-) >>>>>> >>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>> index e5951ba12a28..9c20327c2be5 100644 >>>>>> --- a/include/linux/memremap.h >>>>>> +++ b/include/linux/memremap.h >>>>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page >>>>>> *page) >>>>>> } >>>>>> >>>>>> #ifdef CONFIG_ZONE_DEVICE >>>>>> -void zone_device_page_init(struct page *page); >>>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order); >>>>>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >>>>>> void memunmap_pages(struct dev_pagemap *pgmap); >>>>>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap >>>>>> *pgmap); >>>>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long >>>>>> pfn); >>>>>> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); >>>>>> >>>>>> unsigned long memremap_compat_align(void); >>>>>> + >>>>>> +static inline void zone_device_page_init(struct page *page) >>>>>> +{ >>>>>> + struct folio *folio = page_folio(page); >>>>>> + >>>>>> + zone_device_folio_init(folio, 0); >>>>> >>>>> I assume it is for legacy code, where only non-compound page exists? >>>>> >>>>> It seems that you assume @page is always order-0, but there is no check >>>>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio) >>>>> above it would be useful to detect misuse. >>>>> >>>>>> +} >>>>>> + >>>>>> #else >>>>>> static inline void *devm_memremap_pages(struct device *dev, >>>>>> struct dev_pagemap *pgmap) >>>>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>>>> index 46cb1b0b6f72..a8481ebf94cc 100644 >>>>>> --- a/mm/memremap.c >>>>>> +++ b/mm/memremap.c >>>>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); >>>>>> void free_zone_device_folio(struct folio *folio) >>>>>> { >>>>>> struct dev_pagemap *pgmap = folio->pgmap; >>>>>> + unsigned long nr = folio_nr_pages(folio); >>>>>> + int i; >>>>>> >>>>>> if (WARN_ON_ONCE(!pgmap)) >>>>>> return; >>>>>> >>>>>> mem_cgroup_uncharge(folio); >>>>>> >>>>>> - /* >>>>>> - * Note: we don't expect anonymous compound pages yet. Once >>>>>> supported >>>>>> - * and we could PTE-map them similar to THP, we'd have to clear >>>>>> - * PG_anon_exclusive on all tail pages. >>>>>> - */ >>>>>> if (folio_test_anon(folio)) { >>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>>>> - __ClearPageAnonExclusive(folio_page(folio, 0)); >>>>>> + for (i = 0; i < nr; i++) >>>>>> + __ClearPageAnonExclusive(folio_page(folio, i)); >>>>>> + } else { >>>>>> + VM_WARN_ON_ONCE(folio_test_large(folio)); >>>>>> } >>>>>> >>>>>> /* >>>>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio) >>>>>> case MEMORY_DEVICE_COHERENT: >>>>>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free)) >>>>>> break; >>>>>> - pgmap->ops->page_free(folio_page(folio, 0)); >>>>>> - put_dev_pagemap(pgmap); >>>>>> + pgmap->ops->page_free(&folio->page); >>>>>> + percpu_ref_put_many(&folio->pgmap->ref, nr); >>>>>> break; >>>>>> >>>>>> case MEMORY_DEVICE_GENERIC: >>>>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio) >>>>>> } >>>>>> } >>>>>> >>>>>> -void zone_device_page_init(struct page *page) >>>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order) >>>>>> { >>>>>> + struct page *page = folio_page(folio, 0); >>>>> >>>>> It is strange to see a folio is converted back to page in >>>>> a function called zone_device_folio_init(). >>>>> >>>>>> + >>>>>> + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>>>> + >>>>>> /* >>>>>> * Drivers shouldn't be allocating pages after calling >>>>>> * memunmap_pages(). >>>>>> */ >>>>>> - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); >>>>>> - set_page_count(page, 1); >>>>>> + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 >>>>>> << order)); >>>>>> + folio_set_count(folio, 1); >>>>>> lock_page(page); >>>>>> + >>>>>> + if (order > 1) { >>>>>> + prep_compound_page(page, order); >>>>>> + folio_set_large_rmappable(folio); >>>>>> + } >>>>> >>>>> OK, so basically, @folio is not a compound page yet when >>>>> zone_device_folio_init() >>>>> is called. >>>>> >>>>> I feel that your zone_device_page_init() and zone_device_folio_init() >>>>> implementations are inverse. They should follow the same pattern >>>>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where >>>>> zone_device_page_init() does the actual initialization and >>>>> zone_device_folio_init() just convert a page to folio. >>>>> >>>>> Something like: >>>>> >>>>> void zone_device_page_init(struct page *page, unsigned int order) >>>>> { >>>>> VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>>> >>>>> /* >>>>> * Drivers shouldn't be allocating pages after calling >>>>> * memunmap_pages(). >>>>> */ >>>>> >>>>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << >>>>> order)); >>>>> >>>>> /* >>>>> * anonymous folio does not support order-1, high order file-backed >>>>> folio >>>>> * is not supported at all. >>>>> */ >>>>> VM_WARN_ON_ONCE(order == 1); >>>>> >>>>> if (order > 1) >>>>> prep_compound_page(page, order); >>>>> >>>>> /* page has to be compound head here */ >>>>> set_page_count(page, 1); >>>>> lock_page(page); >>>>> } >>>>> >>>>> void zone_device_folio_init(struct folio *folio, unsigned int order) >>>>> { >>>>> struct page *page = folio_page(folio, 0); >>>>> >>>>> zone_device_page_init(page, order); >>>>> page_rmappable_folio(page); >>>>> } >>>>> >>>>> Or >>>>> >>>>> struct folio *zone_device_folio_init(struct page *page, unsigned int >>>>> order) >>>>> { >>>>> zone_device_page_init(page, order); >>>>> return page_rmappable_folio(page); >>>>> } >>>>> >>>>> >>>>> Then, it comes to free_zone_device_folio() above, >>>>> I feel that pgmap->ops->page_free() should take an additional order >>>>> parameter to free a compound page like free_frozen_pages(). >>>>> >>>>> >>>>> This is my impression after reading the patch and zone device page code. >>>>> >>>>> Alistair and David can correct me if this is wrong, since I am new to >>>>> zone device page code. >>>>> >>>> >>>> Thanks, I did not want to change zone_device_page_init() for several >>>> drivers (outside my test scope) that already assume it has an order size >>>> of 0. >>> >>> But my proposed zone_device_page_init() should still work for order-0 >>> pages. You just need to change call site to add 0 as a new parameter. >>> >> >> I did not want to change existing callers (increases testing impact) >> without a strong reason. >> >>> >>> One strange thing I found in the original zone_device_page_init() is >>> the use of page_pgmap() in >>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)). >>> page_pgmap() calls page_folio() on the given page to access pgmap field. >>> And pgmap field is only available in struct folio. The code initializes >>> struct page, but in middle it suddenly finds the page is actually a folio, >>> then treat it as a page afterwards. I wonder if it can be done better. >>> >>> This might be a question to Alistair, since he made the change. >>> >> >> I'll let him answer it :) > > Not him, but I think this goes back to my question raised in my other reply: > When would we allocate "struct folio" in the future. > > If it's "always" then actually most of the zone-device code would only ever > operate on folios and never on pages in the future. > > I recall during a discussion at LSF/MM I raised that, and the answer was > (IIRC) that we will allocate "struct folio" as we will initialize the memmap > for dax. > > So essentially, we'd always have folios and would never really have to > operate on pages.
Hmm, then what is the point of having “struct folio”, which originally is added to save compound_head() calls, where everything is a folio in device private world? We might need DAX people to explain the rationale of “always struct folio”. Best Regards, Yan, Zi
