On 3/9/26 4:43 PM, Shakeel Butt wrote:
On Fri, Mar 06, 2026 at 08:55:20PM -0800, JP Kobryn (Meta) wrote:
When investigating pressure on a NUMA node, there is no straightforward way
to determine which policies are driving allocations to it.

Add per-policy page allocation counters as new node stat items. These
counters track allocations to nodes and also whether the allocations were
intentional or fallbacks.

The new stats follow the existing numa hit/miss/foreign style and have the
following meanings:

   hit
     - for BIND and PREFERRED_MANY, allocation succeeded on node in nodemask
     - for other policies, allocation succeeded on intended node
     - counted on the node of the allocation
   miss
     - allocation intended for other node, but happened on this one
     - counted on other node
   foreign
     - allocation intended on this node, but happened on other node
     - counted on this node

Counters are exposed per-memcg, per-node in memory.numa_stat and globally
in /proc/vmstat.

Signed-off-by: JP Kobryn (Meta) <[email protected]>

[...]

+
+       rcu_read_lock();
+       memcg = mem_cgroup_from_task(current);
+
+       if (is_hit) {
+               lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(actual_nid));
+               mod_lruvec_state(lruvec, hit_idx, nr_pages);
+       } else {
+               /* account for miss on the fallback node */
+               lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(actual_nid));
+               mod_lruvec_state(lruvec, hit_idx + 1, nr_pages);
+
+               /* account for foreign on the intended node */
+               lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(intended_nid));
+               mod_lruvec_state(lruvec, hit_idx + 2, nr_pages);
+       }

This seems like monotonic increasing metrics and I think you don't care about
their absolute value but rather rate of change. Any reason this can not be
achieved through tracepoints and BPF combination?

We have the per-node reclaim stats (pg{steal,scan,refill}) in
nodeN/vmstat and memory.numa_stat now. The new stats in this patch would
be collected from the same source. They were meant to be used together,
so it seemed like a reasonable location. I think the advantage over
tracepoints is we get the observability on from the start and it would
be simple to extend existing programs that already read stats from the
cgroup dir files.

Reply via email to