On 20 Jan 2026, at 18:02, Jordan Niethe wrote: > Hi, > > On 21/1/26 09:53, Zi Yan wrote: >> On 20 Jan 2026, at 17:33, Jordan Niethe wrote: >> >>> On 14/1/26 07:04, Zi Yan wrote: >>>> On 7 Jan 2026, at 4:18, Jordan Niethe wrote: >>>> >>>>> Currently when creating these device private struct pages, the first >>>>> step is to use request_free_mem_region() to get a range of physical >>>>> address space large enough to represent the devices memory. This >>>>> allocated physical address range is then remapped as device private >>>>> memory using memremap_pages(). >>>>> >>>>> Needing allocation of physical address space has some problems: >>>>> >>>>> 1) There may be insufficient physical address space to represent the >>>>> device memory. KASLR reducing the physical address space and VM >>>>> configurations with limited physical address space increase the >>>>> likelihood of hitting this especially as device memory increases. >>>>> This >>>>> has been observed to prevent device private from being initialized. >>>>> >>>>> 2) Attempting to add the device private pages to the linear map at >>>>> addresses beyond the actual physical memory causes issues on >>>>> architectures like aarch64 meaning the feature does not work there. >>>>> >>>>> Instead of using the physical address space, introduce a device private >>>>> address space and allocate devices regions from there to represent the >>>>> device private pages. >>>>> >>>>> Introduce a new interface memremap_device_private_pagemap() that >>>>> allocates a requested amount of device private address space and creates >>>>> the necessary device private pages. >>>>> >>>>> To support this new interface, struct dev_pagemap needs some changes: >>>>> >>>>> - Add a new dev_pagemap::nr_pages field as an input parameter. >>>>> - Add a new dev_pagemap::pages array to store the device >>>>> private pages. >>>>> >>>>> When using memremap_device_private_pagemap(), rather then passing in >>>>> dev_pagemap::ranges[dev_pagemap::nr_ranges] of physical address space to >>>>> be remapped, dev_pagemap::nr_ranges will always be 1, and the device >>>>> private range that is reserved is returned in dev_pagemap::range. >>>>> >>>>> Forbid calling memremap_pages() with dev_pagemap::ranges::type = >>>>> MEMORY_DEVICE_PRIVATE. >>>>> >>>>> Represent this device private address space using a new >>>>> device_private_pgmap_tree maple tree. This tree maps a given device >>>>> private address to a struct dev_pagemap, where a specific device private >>>>> page may then be looked up in that dev_pagemap::pages array. >>>>> >>>>> Device private address space can be reclaimed and the assoicated device >>>>> private pages freed using the corresponding new >>>>> memunmap_device_private_pagemap() interface. >>>>> >>>>> Because the device private pages now live outside the physical address >>>>> space, they no longer have a normal PFN. This means that page_to_pfn(), >>>>> et al. are no longer meaningful. >>>>> >>>>> Introduce helpers: >>>>> >>>>> - device_private_page_to_offset() >>>>> - device_private_folio_to_offset() >>>>> >>>>> to take a given device private page / folio and return its offset within >>>>> the device private address space. >>>>> >>>>> Update the places where we previously converted a device private page to >>>>> a PFN to use these new helpers. When we encounter a device private >>>>> offset, instead of looking up its page within the pagemap use >>>>> device_private_offset_to_page() instead. >>>>> >>>>> Update the existing users: >>>>> >>>>> - lib/test_hmm.c >>>>> - ppc ultravisor >>>>> - drm/amd/amdkfd >>>>> - gpu/drm/xe >>>>> - gpu/drm/nouveau >>>>> >>>>> to use the new memremap_device_private_pagemap() interface. >>>>> >>>>> Signed-off-by: Jordan Niethe <[email protected]> >>>>> Signed-off-by: Alistair Popple <[email protected]> >>>>> >>>>> --- >>>>> >>>>> NOTE: The updates to the existing drivers have only been compile tested. >>>>> I'll need some help in testing these drivers. >>>>> >>>>> v1: >>>>> - Include NUMA node paramater for memremap_device_private_pagemap() >>>>> - Add devm_memremap_device_private_pagemap() and friends >>>>> - Update existing users of memremap_pages(): >>>>> - ppc ultravisor >>>>> - drm/amd/amdkfd >>>>> - gpu/drm/xe >>>>> - gpu/drm/nouveau >>>>> - Update for HMM huge page support >>>>> - Guard device_private_offset_to_page and friends with CONFIG_ZONE_DEVICE >>>>> >>>>> v2: >>>>> - Make sure last member of struct dev_pagemap remains >>>>> DECLARE_FLEX_ARRAY(struct range, ranges); >>>>> --- >>>>> Documentation/mm/hmm.rst | 11 +- >>>>> arch/powerpc/kvm/book3s_hv_uvmem.c | 41 ++--- >>>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 23 +-- >>>>> drivers/gpu/drm/nouveau/nouveau_dmem.c | 35 ++-- >>>>> drivers/gpu/drm/xe/xe_svm.c | 28 +--- >>>>> include/linux/hmm.h | 3 + >>>>> include/linux/leafops.h | 16 +- >>>>> include/linux/memremap.h | 64 +++++++- >>>>> include/linux/migrate.h | 6 +- >>>>> include/linux/mm.h | 2 + >>>>> include/linux/rmap.h | 5 +- >>>>> include/linux/swapops.h | 10 +- >>>>> lib/test_hmm.c | 69 ++++---- >>>>> mm/debug.c | 9 +- >>>>> mm/memremap.c | 193 ++++++++++++++++++----- >>>>> mm/mm_init.c | 8 +- >>>>> mm/page_vma_mapped.c | 19 ++- >>>>> mm/rmap.c | 43 +++-- >>>>> mm/util.c | 5 +- >>>>> 19 files changed, 391 insertions(+), 199 deletions(-) >>>>> >>>> <snip> >>>> >>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>>> index e65329e1969f..b36599ab41ba 100644 >>>>> --- a/include/linux/mm.h >>>>> +++ b/include/linux/mm.h >>>>> @@ -2038,6 +2038,8 @@ static inline unsigned long >>>>> memdesc_section(memdesc_flags_t mdf) >>>>> */ >>>>> static inline unsigned long folio_pfn(const struct folio *folio) >>>>> { >>>>> + VM_BUG_ON(folio_is_device_private(folio)); >>>> >>>> Please use VM_WARN_ON instead. >>> >>> ack. >>> >>>> >>>>> + >>>>> return page_to_pfn(&folio->page); >>>>> } >>>>> >>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h >>>>> index 57c63b6a8f65..c1561a92864f 100644 >>>>> --- a/include/linux/rmap.h >>>>> +++ b/include/linux/rmap.h >>>>> @@ -951,7 +951,7 @@ static inline unsigned long >>>>> page_vma_walk_pfn(unsigned long pfn) >>>>> static inline unsigned long folio_page_vma_walk_pfn(const struct folio >>>>> *folio) >>>>> { >>>>> if (folio_is_device_private(folio)) >>>>> - return page_vma_walk_pfn(folio_pfn(folio)) | >>>>> + return page_vma_walk_pfn(device_private_folio_to_offset(folio)) >>>>> | >>>>> PVMW_PFN_DEVICE_PRIVATE; >>>>> >>>>> return page_vma_walk_pfn(folio_pfn(folio)); >>>>> @@ -959,6 +959,9 @@ static inline unsigned long >>>>> folio_page_vma_walk_pfn(const struct folio *folio) >>>>> >>>>> static inline struct page *page_vma_walk_pfn_to_page(unsigned long >>>>> pvmw_pfn) >>>>> { >>>>> + if (pvmw_pfn & PVMW_PFN_DEVICE_PRIVATE) >>>>> + return device_private_offset_to_page(pvmw_pfn >> >>>>> PVMW_PFN_SHIFT); >>>>> + >>>>> return pfn_to_page(pvmw_pfn >> PVMW_PFN_SHIFT); >>>>> } >>>> >>>> <snip> >>>> >>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c >>>>> index 96c525785d78..141fe5abd33f 100644 >>>>> --- a/mm/page_vma_mapped.c >>>>> +++ b/mm/page_vma_mapped.c >>>>> @@ -107,6 +107,7 @@ static bool map_pte(struct page_vma_mapped_walk >>>>> *pvmw, pmd_t *pmdvalp, >>>>> static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long >>>>> pte_nr) >>>>> { >>>>> unsigned long pfn; >>>>> + bool device_private = false; >>>>> pte_t ptent = ptep_get(pvmw->pte); >>>>> >>>>> if (pvmw->flags & PVMW_MIGRATION) { >>>>> @@ -115,6 +116,9 @@ static bool check_pte(struct page_vma_mapped_walk >>>>> *pvmw, unsigned long pte_nr) >>>>> if (!softleaf_is_migration(entry)) >>>>> return false; >>>>> >>>>> + if (softleaf_is_migration_device_private(entry)) >>>>> + device_private = true; >>>>> + >>>>> pfn = softleaf_to_pfn(entry); >>>>> } else if (pte_present(ptent)) { >>>>> pfn = pte_pfn(ptent); >>>>> @@ -127,8 +131,14 @@ static bool check_pte(struct page_vma_mapped_walk >>>>> *pvmw, unsigned long pte_nr) >>>>> return false; >>>>> >>>>> pfn = softleaf_to_pfn(entry); >>>>> + >>>>> + if (softleaf_is_device_private(entry)) >>>>> + device_private = true; >>>>> } >>>>> >>>>> + if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE)) >>>>> + return false; >>>>> + >>>>> if ((pfn + pte_nr - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT)) >>>>> return false; >>>>> if (pfn > ((pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1)) >>>>> @@ -137,8 +147,11 @@ static bool check_pte(struct page_vma_mapped_walk >>>>> *pvmw, unsigned long pte_nr) >>>>> } >>>>> >>>>> /* Returns true if the two ranges overlap. Careful to not overflow. */ >>>>> -static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk >>>>> *pvmw) >>>>> +static bool check_pmd(unsigned long pfn, bool device_private, struct >>>>> page_vma_mapped_walk *pvmw) >>>>> { >>>>> + if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE)) >>>>> + return false; >>>>> + >>>>> if ((pfn + HPAGE_PMD_NR - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT)) >>>>> return false; >>>>> if (pfn > (pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1) >>>>> @@ -255,6 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk >>>>> *pvmw) >>>>> >>>>> if (!softleaf_is_migration(entry) || >>>>> !check_pmd(softleaf_to_pfn(entry), >>>>> + >>>>> softleaf_is_device_private(entry) || >>>>> + >>>>> softleaf_is_migration_device_private(entry), >>>>> pvmw)) >>>>> return not_found(pvmw); >>>>> return true; >>>>> @@ -262,7 +277,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk >>>>> *pvmw) >>>>> if (likely(pmd_trans_huge(pmde))) { >>>>> if (pvmw->flags & PVMW_MIGRATION) >>>>> return not_found(pvmw); >>>>> - if (!check_pmd(pmd_pfn(pmde), pvmw)) >>>>> + if (!check_pmd(pmd_pfn(pmde), false, pvmw)) >>>>> return not_found(pvmw); >>>>> return true; >>>>> } >>>> >>>> It seems to me that you can add a new flag like “bool is_device_private” to >>>> indicate whether pfn is a device private index instead of pfn without >>>> manipulating pvmw->pfn itself. >>> >>> We could do it like that, however my concern with using a new param was that >>> storing this info seperately might make it easier to misuse a device private >>> index as a regular pfn. >>> >>> It seemed like it could be easy to overlook both when creating the pvmw and >>> then when accessing the pfn. >> >> That is why I asked for a helper function like page_vma_walk_pfn(pvmw) to >> return the converted pfn instead of pvmw->pfn directly. You can add a comment >> to ask people to use helper function and even mark pvmw->pfn /* do not use >> directly */. > > Yeah I agree that is a good idea. > >> >> In addition, your patch manipulates pfn by left shifting it by 1. Are you >> sure >> there is no weird arch having pfns with bit 63 being 1? Your change could >> break it, right? > > Currently for migrate pfns we left shift by pfns by MIGRATE_PFN_SHIFT (6), so > I > thought doing something similiar here should be safe.
Yeah, but that limits to archs supporting HMM. page_vma_mapped_walk is used by almost every arch, so it has a broader impact. Best Regards, Yan, Zi
