On Wed, Dec 03, 2025 at 03:36:33PM +1100, Balbir Singh wrote:
> > - I discussed in my note to David that this is probably the right
> > way to go about doing it. I think N_MEMORY can still be set, if
> > a new global-default-node policy is created.
> >
>
> I still think N_MEMORY as a flag should mean something different from
> N_SPM_NODE_MEMORY because their characteristics are different
>
... snip ... (I agree, see later)
> > - Instead, I can see either per-component policies (reclaim->nodes)
> > or a global policy that covers all of those components (similar to
> > my sysram_nodes). Drivers would then be responsible to register
> > their hotplugged memory nodes with those components accordingly.
> >
>
> To me node zonelists provide the right abstraction of where to allocate from
> and how to fallback as needed. I'll read your patches to figure out how your
> approach is different. I wanted the isolation at allocation time
>
... snip ... (I agree, see later)
>
> Yes, we should look at the pros and cons. To be honest, I'd wouldn't be
> opposed to having kswapd and reclaim look different for these nodes, it
> would also mean that we'd need pagecache hooks if we want page cache on
> these nodes. Everything else, including move_pages() should just work.
>
Basically my series does (roughly) the same as yours, but adds the
cpusets controls and a GFP flag. The MHP extention should ultimately
be converted to N_SPM_NODE_MEMORY (or whatever we decide to name it).
After some more time to think, I think we want all of it.
- N_SPM_NODE_MEMORY (or whatever we call it) handles filtering out
SPM at allocation time by default and protects all current users
of N_MEMORY from exposure to SPM.
- cpusets controls allow userland isolation control and a default sysram
mask (I think cpusets.sysram_nodes doesn't even need to be exposed via
sysfs to be honest). cpusets fix is needed due to task->mems_allowed
being used as a default nodemask on systems using cgroups/cpusets.
- GFP_SP_NODE protects against someone doing something like:
get_page_from_freelist(..., node_states[N_POSSIBLE])
or
numactl --interleave --all ./my_program
While providing a way to punch an explicit hole in the isolation
(GFP_SP_NODE means "Use N_SPM_NODE_MEMORY instead of N_MEMORY")
This could be argued against so long as we restrict mempolicy.c
to N_MEMORY nodes (to avoid `--interleave --all` issues), but this
limitation may not be preferable.
My concern is for breaking existing userland software that happens
to run on a system with SPM - but you can probably imagine many more
bad scenarios.
~Gregory