On 1/14/26 10:36 PM, Mathieu Desnoyers wrote:
Use the precise, albeit slower, precise RSS counter sums for the OOM
killer task selection and console dumps. The approximated value is
too imprecise on large many-core systems.

The following rss tracking issues were noted by Sweet Tea Dorminy [1],
which lead to picking wrong tasks as OOM kill target:

   Recently, several internal services had an RSS usage regression as part of a
   kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to
   read RSS statistics in a backup watchdog process to monitor and decide if
   they'd overrun their memory budget. Now, however, a representative service
   with five threads, expected to use about a hundred MB of memory, on a 250-cpu
   machine had memory usage tens of megabytes different from the expected amount
   -- this constituted a significant percentage of inaccuracy, causing the
   watchdog to act.

   This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats
   into percpu_counter") [1].  Previously, the memory error was bounded by
   64*nr_threads pages, a very livable megabyte. Now, however, as a result of
   scheduler decisions moving the threads around the CPUs, the memory error 
could
   be as large as a gigabyte.

   This is a really tremendous inaccuracy for any few-threaded program on a
   large machine and impedes monitoring significantly. These stat counters are
   also used to make OOM killing decisions, so this additional inaccuracy could
   make a big difference in OOM situations -- either resulting in the wrong
   process being killed, or in less memory being returned from an OOM-kill than
   expected.

Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:

1) Per-thread rss tracking: large error on many-thread processes.

2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
    increased system time in make test workloads [1]. Moreover, the
    inaccuracy increases with O(n^2) with the number of CPUs.

3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
    error is high with systems that have lots of NUMA nodes (32 times
    the number of NUMA nodes).

commit 82241a83cd15 ("mm: fix the inaccurate memory statistics issue for
users") introduced get_mm_counter_sum() for precise proc memory status
queries for some proc files.

The simple fix proposed here is to do the precise per-cpu counters sum
every time a counter value needs to be read. This applies to the OOM
killer task selection, oom task console dumps (printk).

This change increases the latency introduced when the OOM killer
executes in favor of doing a more precise OOM target task selection.
Effectively, the OOM killer iterates on all tasks, for all relevant page
types, for which the precise sum iterates on all possible CPUs.

As a reference, here is the execution time of the OOM killer
before/after the change:

AMD EPYC 9654 96-Core (2 sockets)
Within a KVM, configured with 256 logical cpus.

                                   |  before  |  after   |
----------------------------------|----------|----------|
nr_processes=40                   |  0.3 ms  |   0.5 ms |
nr_processes=10000                |  3.0 ms  |  80.0 ms |

Suggested-by: Michal Hocko <[email protected]>
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Link: 
https://lore.kernel.org/lkml/[email protected]/ 
# [1]
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Martin Liu <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: [email protected]
Cc: Shakeel Butt <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Sweet Tea Dorminy <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: "Liam R . Howlett" <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Yu Zhao <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Aboorva Devarajan <[email protected]>
---
This patch replaces v1. It's aimed at mm-new.

Changes since v1:
- Only change the oom killer RSS values from approximated to precise
   sums. Do not change other RSS values users.
---

LGTM.
Reviewed-by: Baolin Wang <[email protected]>

Reply via email to