On 2/16/26 1:07 PM, Michal Hocko wrote:
On Mon 16-02-26 09:50:26, JP Kobryn (Meta) wrote:
On 2/16/26 12:26 AM, Michal Hocko wrote:
On Thu 12-02-26 13:22:56, JP Kobryn wrote:
On 2/11/26 11:29 PM, Michal Hocko wrote:
On Wed 11-02-26 20:51:08, JP Kobryn wrote:
It would be useful to see a breakdown of allocations to understand which
NUMA policies are driving them. For example, when investigating memory
pressure, having policy-specific counts could show that allocations were
bound to the affected node (via MPOL_BIND).

Add per-policy page allocation counters as new node stat items. These
counters can provide correlation between a mempolicy and pressure on a
given node.

Could you be more specific how exactly do you plan to use those
counters?

Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
we identify the affected node(s), the new mpol counters (this patch)
allow us correlate the pressure to the mempolicy driving it.

I would appreciate somehow more specificity. You are adding counters
that are not really easy to drop once they are in. Sure we have
precedence of dropping some counters in the past so this is not as hard
as usual userspace APIs but still...

How exactly do you tolerate mempolicy allocations to specific nodes?
While MPOL_MBIND is quite straightforward others are less so.

The design does account for this regardless of the policy. In the call
to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is
attributed to the node where the page actually landed.

That much is clear[*]. The consumer side of things is not really clear to
me. How do you know which policy or part of the nodemask of that policy
is the source of the memory pressure on a particular node? In other
words how much is the data actually useful except for a single node
mempolicy (i.e. MBIND).

Other than the bind policy, having the interleave (and weighted) stats
would allow us to see the effective distribution of the policy. Pressure
could be linked to a user configured weight scheme. I would think it
could also help with confirming expected distributions.

You brought up the node mask so with the preferred policy, I think this
is a good one for using the counters as well. Once we're at the point
where we know the node(s) under pressure and then see significant
preferred allocs accounted for, we could search the numa_maps that have
"prefer:<node>" to find the tasks targeting the affected nodes.

I mentioned this on another thread in this series but I'll include here
as well and expand some more. For any given policy, the workflow would
be:
1) Pressure/OOMs reported while system-wide memory is free.
2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow
down node(s) under pressure. They become available in
/sys/devices/system/node/nodeN/vmstat.
3) Check per-policy allocation counters (this patch) on that node to
find what policy was driving it. Same readout at nodeN/vmstat.
4) Now use /proc/*/numa_maps to identify tasks using the policy.


[*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
target node even when the allocation is from a remote node from the
"local" POV.

It's a good point. The accounting as a result of fallback cases
shouldn't detract from an investigation though. We're interested in the
node(s) under pressure so the relatively few fallback allocations would
land on nodes that are not under pressure and could be viewed as
acceptable noise.

Reply via email to