On 21/1/26 13:41, Zi Yan wrote:
On 20 Jan 2026, at 18:34, Jordan Niethe wrote:

Hi,

On 21/1/26 10:06, Zi Yan wrote:
On 20 Jan 2026, at 18:02, Jordan Niethe wrote:

Hi,

On 21/1/26 09:53, Zi Yan wrote:
On 20 Jan 2026, at 17:33, Jordan Niethe wrote:

On 14/1/26 07:04, Zi Yan wrote:
On 7 Jan 2026, at 4:18, Jordan Niethe wrote:

Currently when creating these device private struct pages, the first
step is to use request_free_mem_region() to get a range of physical
address space large enough to represent the devices memory. This
allocated physical address range is then remapped as device private
memory using memremap_pages().

Needing allocation of physical address space has some problems:

      1) There may be insufficient physical address space to represent the
         device memory. KASLR reducing the physical address space and VM
         configurations with limited physical address space increase the
         likelihood of hitting this especially as device memory increases. This
         has been observed to prevent device private from being initialized.

      2) Attempting to add the device private pages to the linear map at
         addresses beyond the actual physical memory causes issues on
         architectures like aarch64 meaning the feature does not work there.

Instead of using the physical address space, introduce a device private
address space and allocate devices regions from there to represent the
device private pages.

Introduce a new interface memremap_device_private_pagemap() that
allocates a requested amount of device private address space and creates
the necessary device private pages.

To support this new interface, struct dev_pagemap needs some changes:

      - Add a new dev_pagemap::nr_pages field as an input parameter.
      - Add a new dev_pagemap::pages array to store the device
        private pages.

When using memremap_device_private_pagemap(), rather then passing in
dev_pagemap::ranges[dev_pagemap::nr_ranges] of physical address space to
be remapped, dev_pagemap::nr_ranges will always be 1, and the device
private range that is reserved is returned in dev_pagemap::range.

Forbid calling memremap_pages() with dev_pagemap::ranges::type =
MEMORY_DEVICE_PRIVATE.

Represent this device private address space using a new
device_private_pgmap_tree maple tree. This tree maps a given device
private address to a struct dev_pagemap, where a specific device private
page may then be looked up in that dev_pagemap::pages array.

Device private address space can be reclaimed and the assoicated device
private pages freed using the corresponding new
memunmap_device_private_pagemap() interface.

Because the device private pages now live outside the physical address
space, they no longer have a normal PFN. This means that page_to_pfn(),
et al. are no longer meaningful.

Introduce helpers:

      - device_private_page_to_offset()
      - device_private_folio_to_offset()

to take a given device private page / folio and return its offset within
the device private address space.

Update the places where we previously converted a device private page to
a PFN to use these new helpers. When we encounter a device private
offset, instead of looking up its page within the pagemap use
device_private_offset_to_page() instead.

Update the existing users:

     - lib/test_hmm.c
     - ppc ultravisor
     - drm/amd/amdkfd
     - gpu/drm/xe
     - gpu/drm/nouveau

to use the new memremap_device_private_pagemap() interface.

Signed-off-by: Jordan Niethe <[email protected]>
Signed-off-by: Alistair Popple <[email protected]>

---

NOTE: The updates to the existing drivers have only been compile tested.
I'll need some help in testing these drivers.

v1:
- Include NUMA node paramater for memremap_device_private_pagemap()
- Add devm_memremap_device_private_pagemap() and friends
- Update existing users of memremap_pages():
        - ppc ultravisor
        - drm/amd/amdkfd
        - gpu/drm/xe
        - gpu/drm/nouveau
- Update for HMM huge page support
- Guard device_private_offset_to_page and friends with CONFIG_ZONE_DEVICE

v2:
- Make sure last member of struct dev_pagemap remains DECLARE_FLEX_ARRAY(struct 
range, ranges);
---
     Documentation/mm/hmm.rst                 |  11 +-
     arch/powerpc/kvm/book3s_hv_uvmem.c       |  41 ++---
     drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  23 +--
     drivers/gpu/drm/nouveau/nouveau_dmem.c   |  35 ++--
     drivers/gpu/drm/xe/xe_svm.c              |  28 +---
     include/linux/hmm.h                      |   3 +
     include/linux/leafops.h                  |  16 +-
     include/linux/memremap.h                 |  64 +++++++-
     include/linux/migrate.h                  |   6 +-
     include/linux/mm.h                       |   2 +
     include/linux/rmap.h                     |   5 +-
     include/linux/swapops.h                  |  10 +-
     lib/test_hmm.c                           |  69 ++++----
     mm/debug.c                               |   9 +-
     mm/memremap.c                            | 193 ++++++++++++++++++-----
     mm/mm_init.c                             |   8 +-
     mm/page_vma_mapped.c                     |  19 ++-
     mm/rmap.c                                |  43 +++--
     mm/util.c                                |   5 +-
     19 files changed, 391 insertions(+), 199 deletions(-)

<snip>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e65329e1969f..b36599ab41ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2038,6 +2038,8 @@ static inline unsigned long 
memdesc_section(memdesc_flags_t mdf)
      */
     static inline unsigned long folio_pfn(const struct folio *folio)
     {
+       VM_BUG_ON(folio_is_device_private(folio));

Please use VM_WARN_ON instead.

ack.


+
        return page_to_pfn(&folio->page);
     }

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 57c63b6a8f65..c1561a92864f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -951,7 +951,7 @@ static inline unsigned long page_vma_walk_pfn(unsigned long 
pfn)
     static inline unsigned long folio_page_vma_walk_pfn(const struct folio 
*folio)
     {
        if (folio_is_device_private(folio))
-               return page_vma_walk_pfn(folio_pfn(folio)) |
+               return page_vma_walk_pfn(device_private_folio_to_offset(folio)) 
|
                       PVMW_PFN_DEVICE_PRIVATE;

        return page_vma_walk_pfn(folio_pfn(folio));
@@ -959,6 +959,9 @@ static inline unsigned long folio_page_vma_walk_pfn(const 
struct folio *folio)

     static inline struct page *page_vma_walk_pfn_to_page(unsigned long 
pvmw_pfn)
     {
+       if (pvmw_pfn & PVMW_PFN_DEVICE_PRIVATE)
+               return device_private_offset_to_page(pvmw_pfn >> 
PVMW_PFN_SHIFT);
+
        return pfn_to_page(pvmw_pfn >> PVMW_PFN_SHIFT);
     }

<snip>

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 96c525785d78..141fe5abd33f 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -107,6 +107,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, 
pmd_t *pmdvalp,
     static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long 
pte_nr)
     {
        unsigned long pfn;
+       bool device_private = false;
        pte_t ptent = ptep_get(pvmw->pte);

        if (pvmw->flags & PVMW_MIGRATION) {
@@ -115,6 +116,9 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, 
unsigned long pte_nr)
                if (!softleaf_is_migration(entry))
                        return false;

+               if (softleaf_is_migration_device_private(entry))
+                       device_private = true;
+
                pfn = softleaf_to_pfn(entry);
        } else if (pte_present(ptent)) {
                pfn = pte_pfn(ptent);
@@ -127,8 +131,14 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, 
unsigned long pte_nr)
                        return false;

                pfn = softleaf_to_pfn(entry);
+
+               if (softleaf_is_device_private(entry))
+                       device_private = true;
        }

+       if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE))
+               return false;
+
        if ((pfn + pte_nr - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT))
                return false;
        if (pfn > ((pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1))
@@ -137,8 +147,11 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, 
unsigned long pte_nr)
     }

     /* Returns true if the two ranges overlap.  Careful to not overflow. */
-static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
+static bool check_pmd(unsigned long pfn, bool device_private, struct 
page_vma_mapped_walk *pvmw)
     {
+       if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE))
+               return false;
+
        if ((pfn + HPAGE_PMD_NR - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT))
                return false;
        if (pfn > (pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1)
@@ -255,6 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)

                                if (!softleaf_is_migration(entry) ||
                                    !check_pmd(softleaf_to_pfn(entry),
+                                              
softleaf_is_device_private(entry) ||
+                                              
softleaf_is_migration_device_private(entry),
                                               pvmw))
                                        return not_found(pvmw);
                                return true;
@@ -262,7 +277,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
                        if (likely(pmd_trans_huge(pmde))) {
                                if (pvmw->flags & PVMW_MIGRATION)
                                        return not_found(pvmw);
-                               if (!check_pmd(pmd_pfn(pmde), pvmw))
+                               if (!check_pmd(pmd_pfn(pmde), false, pvmw))
                                        return not_found(pvmw);
                                return true;
                        }

It seems to me that you can add a new flag like “bool is_device_private” to
indicate whether pfn is a device private index instead of pfn without
manipulating pvmw->pfn itself.

We could do it like that, however my concern with using a new param was that
storing this info seperately might make it easier to misuse a device private
index as a regular pfn.

It seemed like it could be easy to overlook both when creating the pvmw and
then when accessing the pfn.

That is why I asked for a helper function like page_vma_walk_pfn(pvmw) to
return the converted pfn instead of pvmw->pfn directly. You can add a comment
to ask people to use helper function and even mark pvmw->pfn /* do not use
directly */.

Yeah I agree that is a good idea.


In addition, your patch manipulates pfn by left shifting it by 1. Are you sure
there is no weird arch having pfns with bit 63 being 1? Your change could
break it, right?

Currently for migrate pfns we left shift by pfns by MIGRATE_PFN_SHIFT (6), so I
thought doing something similiar here should be safe.

Yeah, but that limits to archs supporting HMM. page_vma_mapped_walk is used
by almost every arch, so it has a broader impact.

That is a good point.

I see a few options:

- On every arch we can assume SWP_PFN_BITS? I could add a sanity check that we
   have an extra bit on top of SWP_PFN_BITS within an unsigned long.

Yes, but if there is no extra bit, are you going to disable device private
pages?

In this case, migrate PFNs would also be broken (due to MIGRATE_PFN_SHIFT) so 
we'd have to.


- We could define PVMW_PFN_SHIFT as 0 if !CONFIG_MIGRATION as the flag is not
   required.

Sure, or !CONFIG_DEVICE_MIGRATION

- Instead of modifying pvmw->pfn we could use pvmw->flags but that has the
   issues of separating the offset type and offset.

It seems that I was not clear on my proposal. Here is the patch on top of
your patchset and it compiles.

Oh I'd interpreted “bool is_device_private” as adding a new field to pvmw.


Basically, pvmw->pfn stores either PFN or device private offset without
additional shift. Caller interprets pvmw->pfn based on
pvmw->flags & PVMW_DEVICE_PRIVATE. And you can ignore my helper function
of pvmw->pfn suggestion, since my patch below can use pvmw->pfn directly.

Thanks, looks reasonable. I'll try it.

Thanks,
Jordan.


Let me know if my patch works. Thanks.

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c1561a92864f..4423f0e886aa 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -921,6 +921,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, 
unsigned long addr,
  #define PVMW_SYNC             (1 << 0)
  /* Look for migration entries rather than present PTEs */
  #define PVMW_MIGRATION                (1 << 1)
+#define PVMW_DEVICE_PRIVATE    (1 << 2)

  /* Result flags */

@@ -943,6 +944,13 @@ struct page_vma_mapped_walk {
  #define PVMW_PFN_DEVICE_PRIVATE       (1UL << 0)
  #define PVMW_PFN_SHIFT                1

+static inline unsigned long page_vma_walk_flags(struct folio *folio, unsigned 
long flags)
+{
+       if (folio_is_device_private(folio))
+               return flags | PVMW_DEVICE_PRIVATE;
+       return flags;
+}
+
  static inline unsigned long page_vma_walk_pfn(unsigned long pfn)
  {
        return (pfn << PVMW_PFN_SHIFT);
@@ -951,23 +959,16 @@ static inline unsigned long page_vma_walk_pfn(unsigned 
long pfn)
  static inline unsigned long folio_page_vma_walk_pfn(const struct folio *folio)
  {
        if (folio_is_device_private(folio))
-               return page_vma_walk_pfn(device_private_folio_to_offset(folio)) 
|
-                      PVMW_PFN_DEVICE_PRIVATE;
-
-       return page_vma_walk_pfn(folio_pfn(folio));
-}
-
-static inline struct page *page_vma_walk_pfn_to_page(unsigned long pvmw_pfn)
-{
-       if (pvmw_pfn & PVMW_PFN_DEVICE_PRIVATE)
-               return device_private_offset_to_page(pvmw_pfn >> 
PVMW_PFN_SHIFT);
+               return device_private_folio_to_offset(folio);

-       return pfn_to_page(pvmw_pfn >> PVMW_PFN_SHIFT);
+       return (folio_pfn(folio));
  }

-static inline struct folio *page_vma_walk_pfn_to_folio(unsigned long pvmw_pfn)
+static inline struct folio *page_vma_walk_pfn_to_folio(struct 
page_vma_mapped_walk *pvmw)
  {
-       return page_folio(page_vma_walk_pfn_to_page(pvmw_pfn));
+       if (pvmw->flags & PVMW_DEVICE_PRIVATE)
+               return page_folio(device_private_offset_to_page(pvmw->pfn));
+       return pfn_folio(pvmw->pfn);
  }

  #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags)   \
@@ -977,7 +978,7 @@ static inline struct folio 
*page_vma_walk_pfn_to_folio(unsigned long pvmw_pfn)
                .pgoff = folio_pgoff(_folio),                           \
                .vma = _vma,                                            \
                .address = _address,                                    \
-               .flags = _flags,                                        \
+               .flags = page_vma_walk_flags(_folio, _flags),           \
        }

  static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk 
*pvmw)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 141fe5abd33f..e61a0e49a7c9 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -136,12 +136,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, 
unsigned long pte_nr)
                        device_private = true;
        }

-       if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE))
+       if ((device_private) ^ !!(pvmw->flags & PVMW_DEVICE_PRIVATE))
                return false;

-       if ((pfn + pte_nr - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT))
+       if ((pfn + pte_nr - 1) < pvmw->pfn)
                return false;
-       if (pfn > ((pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1))
+       if (pfn > (pvmw->pfn + pvmw->nr_pages - 1))
                return false;
        return true;
  }
@@ -149,12 +149,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, 
unsigned long pte_nr)
  /* Returns true if the two ranges overlap.  Careful to not overflow. */
  static bool check_pmd(unsigned long pfn, bool device_private, struct 
page_vma_mapped_walk *pvmw)
  {
-       if ((device_private) ^ !!(pvmw->pfn & PVMW_PFN_DEVICE_PRIVATE))
+       if ((device_private) ^ !!(pvmw->flags & PVMW_DEVICE_PRIVATE))
                return false;

-       if ((pfn + HPAGE_PMD_NR - 1) < (pvmw->pfn >> PVMW_PFN_SHIFT))
+       if ((pfn + HPAGE_PMD_NR - 1) < pvmw->pfn)
                return false;
-       if (pfn > (pvmw->pfn >> PVMW_PFN_SHIFT) + pvmw->nr_pages - 1)
+       if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
                return false;
        return true;
  }
@@ -369,7 +369,7 @@ unsigned long page_mapped_in_vma(const struct page *page,
                .pfn = folio_page_vma_walk_pfn(folio),
                .nr_pages = 1,
                .vma = vma,
-               .flags = PVMW_SYNC,
+               .flags = page_vma_walk_flags(folio, PVMW_SYNC),
        };

        pvmw.address = vma_address(vma, page_pgoff(folio, page), 1);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index be5682d345b5..5d81939bf12a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4203,7 +4203,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk 
*pvmw)
        pte_t *pte = pvmw->pte;
        unsigned long addr = pvmw->address;
        struct vm_area_struct *vma = pvmw->vma;
-       struct folio *folio = page_vma_walk_pfn_to_folio(pvmw->pfn);
+       struct folio *folio = page_vma_walk_pfn_to_folio(pvmw);
        struct mem_cgroup *memcg = folio_memcg(folio);
        struct pglist_data *pgdat = folio_pgdat(folio);
        struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);





Best Regards,
Yan, Zi


Reply via email to