On 2026-02-25 at 02:17 +1100, Gregory Price <[email protected]> wrote...
> On Tue, Feb 24, 2026 at 05:19:11PM +1100, Alistair Popple wrote:
> > On 2026-02-22 at 19:48 +1100, Gregory Price <[email protected]> wrote...
> > 
> > Based on our discussion at LPC I believe one of the primary motivators here 
> > was
> > to re-use the existing mm buddy allocator rather than writing your own. I 
> > remain
> > to be convinced that alone is justification enough for doing all this - DRM 
> > for
> > example already has quite a nice standalone buddy allocator (drm_buddy.c) 
> > that
> > could presumably be used, or adapted for use, by any device driver.
> >
> > The interesting part of this series (which I have skimmed but not read in
> > detail) is how device memory gets exposed to userspace - this is something 
> > that
> > existing ZONE_DEVICE implementations don't address, instead leaving it up to
> > drivers and associated userspace stacks to deal with allocation, migration, 
> > etc.
> > 
> 
> I agree that buddy-access alone is insufficient justification, it
> started off that way - but if you want mempolicy/NUMA UAPI access,
> it turns into "Re-use all of MM" - and that means using the buddy.
> 
> I also expected ZONE_DEVICE vs NODE_DATA to be the primary discussion,
> 
> I raise replacing it as a thought experiment, but not the proposal.
> 
> The idea that drm/ is going to switch to private nodes is outside the
> realm of reality, but part of that is because of years of infrastructure
> built on the assumption that re-using mm/ is infeasible.
> 
> But, lets talk about DEVICE_COHERENT
> 
> ---
> 
> DEVICE_COHERENT is the odd-man out among ZONE_DEVICE modes. The others
> use softleaf entries and don't allow direct mappings.

I think you have this around the wrong way - DEVICE_PRIVATE is the odd one out 
as
it is the one ZONE_DEVICE page type that uses softleaf entries and doesn't
allow direct mappings. Every other type of ZONE_DEVICE page allows for direct
mappings.

> (DEVICE_PRIVATE sort of does if you squint, but you can also view that
>  a bit like PROT_NONE or read-only controls to force migrations).
> 
> If you take DEVICE_COHERENT and:
> 
> - Move pgmap out of the struct page (page_ext, NODE_DATA, etc) to free
>   the LRU list_head
> - Put pages in the buddy (free lists, watermarks, managed_pages) or add
>   pgmap->device_alloc() at every allocation callsite / buddy hook
> - Add LRU support (aging, reclaim, compaction)
> - Add isolated gating (new GFP flag and adjusted zonelist filtering)
> - Add new dev_pagemap_ops callbacks for the various mm/ features
> - Audit evey folio_is_zone_device() to distinguish zone device modes
> 
> ... you've built N_MEMORY_PRIVATE inside ZONE_DEVICE. Except now
> page_zone(page) returns ZONE_DEVICE - so you inherit the wrong
> defaults at every existing ZONE_DEVICE check. 
> 
> Skip-sites become things to opt-out of instead of opting into.
> 
> You just end up with
> 
> if (folio_is_zone_device(folio))
>     if (folio_is_my_special_zone_device())
>     else ....
> 
> and this just generalizes to
> 
> if (folio_is_private_managed(folio))
>     folio_managed_my_hooked_operation()

I don't quite get this - couldn't you just as easily do:

if (folio_is_zone_device(folio))
     folio_device_my_hooked_operation()

Where folio_device_my_hooked_operation() is just:

if (pgmap->ops->my_hoooked_operation)
        pgmap->ops->my_hooked_operation();

> So you get the same code, but have added more complexity to ZONE_DEVICE.

Don't you still have to add code to hook every operation you care about for your
private managed nodes?

> I don't think that's needed if we just recognize ZONE is the wrong
> abstraction to be operating on.
> 
> Honestly, even ZONE_MOVABLE becomes pointless with N_MEMORY_PRIVATE
> if you disallow longterm pinning - because the managing service handles
> allocations (it has to inject GFP_PRIVATE to get access) or selectively
> enables the mm/ services it knows are safe (mempolicy).
> 
> Even if you allow longterm pinning, if your service controls what does
> the pinning it can still be reclaimable - just manually (killing
> processes) instead of letting hotplug do it via migration.
> 
> If your service only allocates movable pages - your ZONE_NORMAL is
> effectively ZONE_MOVABLE.  

This is interesting - it sounds like the conclusion of this is ZONE_* is just a
bad abstraction and should be replaced with something else maybe some like this?

And FWIW I'm not tied to the ZONE_DEVICE as being a good abstraction, it's just
what we seem to have today for determing page types. It almost sounds like what
we want is just a bunch of hooks that can be associated with a range of pages,
and then you just get rid of ZONE_DEVICE and instead install hooks appropriate
for each page a driver manages. I have to think more about that though, this
is just what popped into my head when you start saying ZONE_MOVABLE could also
disappear :-)

> In some cases we use ZONE_MOVABLE to prevent the kernel from allocating
> memory onto devices (like CXL).  This means struct page is forced to
> take up DRAM or use memmap_on_memory - meaning you lose high-value
> capacity or sacrifice contiguity (less huge page support).

One of the other reasons is to prevent long term pinning. But I think that's a
conversation that warrants a whole separate thread.

> This entire problem can evaporate if you can just use ZONE_NORMAL.
> 
> There are a lot of benefits to just re-using the buddy like this.
> 
> Zones are the wrong abstraction and cause more problems.
> 
> > >   free_folio           - mirrors ZONE_DEVICE's
> > >   folio_split          - mirrors ZONE_DEVICE's
> > >   migrate_to           - ... same as ZONE_DEVICE
> > >   handle_fault         - mirrors the ZONE_DEVICE ...
> > >   memory_failure       - parallels memory_failure_dev_pagemap(),
> > 
> > One does not have to squint too hard to see that the above is not so 
> > different
> > from what ZONE_DEVICE provides today via dev_pagemap_ops(). So I think I 
> > think
> > it would be worth outlining why the existing ZONE_DEVICE mechanism can't be
> > extended to provide these kind of services.
> > 
> > This seems to add a bunch of code just to use NODE_DATA instead of 
> > page->pgmap,
> > without really explaining why just extending dev_pagemap_ops wouldn't work. 
> > The
> > obvious reason is that if you want to support things like reclaim, 
> > compaction,
> > etc. these pages need to be on the LRU, which is a little bit hard when that
> > field is also used by the pgmap pointer for ZONE_DEVICE pages.
> > 
> 
> You don't have to squint because it was deliberate :]

Nice.

> The callback similarity is the feature - they're the same logical
> operations.  The difference is the direction of the defaults.
> 
> Extending ZONE_DEVICE into these areas requires the same set of hooks,
> plus distinguishing "old ZONE_DEVICE" from "new ZONE_DEVICE".
> 
> Where there are new injection sites, it's because ZONE_DEVICE opts
> out of ever touching that code in some other silently implied way.

Yeah, I hate that aspect of ZONE_DEVICE. There are far too many places where we
"prove" you can't have a ZONE_DEVICE page because of ad-hoc "reasons". Usually
they take the form of it's not on the LRU, or it's not an anonymous page and
this isn't DAX, etc.

> For example, reclaim/compaction doesn't run because ZONE_DEVICE doesn't
> add to managed_pages (among other reasons).

And people can't even agree on the reasons. I would argue the primary reason is
reclaim/compaction doesn't run because it can't even find the pages due to them
not being on the LRU. But everyone is equally correct.

> You'd have to go figure out how to hack those things into ZONE_DEVICE 
> *and then* opt every *other* ZONE_DEVICE mode *back out*.
> 
> So you still end up with something like this anyway:
> 
> static inline bool folio_managed_handle_fault(struct folio *folio,
>                                               struct vm_fault *vmf,
>                                               enum pgtable_level level,
>                                               vm_fault_t *ret)
> {
>         /* Zone device pages use swap entries; handled in do_swap_page */
>         if (folio_is_zone_device(folio))
>                 return false;
> 
>         if (folio_is_private_node(folio))
>               ...
>         return false;
> }
> 
> 
> > example page_ext could be used.  Or I hear struct page may go away in place 
> > of
> > folios any day now, so maybe that gives us space for both :-)
> > 
> 
> If NUMA is the interface we want, then NODE_DATA is the right direction
> regardless of struct page's future or what zone it lives in.
> 
> There's no reason to keep per-page pgmap w/ device-to-node mappings.

In reality I suspect that's already the case today. I'm not sure we need
per-page pgmap.

> You can have one driver manage multiple devices with the same numa node
> if it uses the same owner context (PFN already differentiates devices).
> 
> The existing code allows for this.
> 
> > The above also looks pretty similar to the existing ZONE_DEVICE methods for
> > doing this which is another reason to argue for just building up the 
> > feature set
> > of the existing boondoggle rather than adding another thingymebob.
> >
> > It seems the key thing we are looking for is:
> > 
> > 1) A userspace API to allocate/manage device memory (ie. move_pages(), 
> > mbind(),
> > etc.)
> > 
> > 2) Allowing reclaim/LRU list processing of device memory.
> > 
> > From my perspective both of these are interesting and I look forward to the
> > discussion (hopefully I can make it to LSFMM). Mostly I'm interested in the
> > implementation as this does on the surface seem to sprinkle around and 
> > duplicate
> > a lot of hooks similar to what ZONE_DEVICE already provides.
> > 
> 
> On (1): ZONE_DEVICE NUMA UAPI is harder than it looks from the surface

Ok, I will admit I've only been hovering on the surface so need to give this
some more thought. Everything you've written below makes sense and is definitely
food for thought. Thanks.

 - Alistair

> Much of the kernel mm/ infrastructure is written on top of the buddy and
> expects N_MEMORY to be the sole arbiter of "Where to Acquire Pages".
> 
> Mempolicy depends on:
>    - Buddy support or a new alloc hook around the buddy
> 
>    - Migration support (mbind() after allocation migrates)
>      - Migration also deeply assumes buddy and LRU support
> 
>    - Changing validations on node states
>      - mempolicy checks N_MEMORY membership, so you have to hack
>        N_MEMORY onto ZONE_DEVICE
>        (or teach it about a new node state... N_MEMORY_PRIVATE)
> 
> 
> Getting mempolicy to work with N_MEMORY_PRIVATE amounts to adding 2
> lines of code in vma_alloc_folio_noprof:
> 
> struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order,
>                                      struct vm_area_struct *vma,
>                                    unsigned long addr)
> {
>         if (pol->flags & MPOL_F_PRIVATE)
>                 gfp |= __GFP_PRIVATE;
> 
>         folio = folio_alloc_mpol_noprof(gfp, order, pol, ilx, numa_node_id());
>       /* Woo! I faulted a DEVICE PAGE! */
> }
> 
> But this requires the pages to be managed by the buddy.
> 
> The rest of the mempolicy support is around keeping sane nodemasks when
> things like cpuset.mems rebinds occur and validating you don't end up
> with private nodes that don't support mempolicy in your nodemask.
> 
> You have to do all of this anyway, but with the added bonus of fighting
> with the overloaded nature of ZONE_DEVICE at every step.
> 
> ==========
> 
> On (2): Assume you solve LRU. 
> 
> Zone Device has no free lists, managed_pages, or watermarks.
> 
> kswapd can't run, compaction has no targets, vmscan's pressure model
> doesn't function.  These all come for free when the pages are
> buddy-managed on a real zone.  Why re-invent the wheel?
> 
> ==========
> 
> So you really have two options here:
> 
> a) Put pages in the buddy, or
> 
> b) Add pgmap->device_alloc() callbacks at every allocation site that
>    could target a node:
>      - vma_alloc_folio
>      - alloc_migration_target
>      - alloc_demote_folio
>      - alloc_pages_node
>      - alloc_contig_pages
>      - list goes on
> 
> Or more likely - hooking get_page_from_freelist.  Which at that
> point... just use the buddy?  You're already deep in the hot path.
> 
> > 
> > For basic allocation I agree this is the case. But there's no reason some 
> > device
> > allocator library couldn't be written. Or in fact as pointed out above 
> > reuse the
> > already existing one in drm_buddy.c.  So would be interested to hear 
> > arguments
> > for why allocation has to be done by the mm allocator and/or why an 
> > allocation
> > library wouldn't work here given DRM already has them.
> > 
> 
> Using the buddy underpins the rest of mm/ services we want to re-use.
> 
> That's basically it.  Otherwise you have to inject hooks into every
> surface that touches the buddy...
> 
> ... or in the buddy (get_page_from_freelist), at which point why not
> just use the buddy?
> 
> ~Gregory

Reply via email to