On 2026-01-13 04:24, Michal Hocko wrote:
[...]
Would you be OK with introducing changes in the following order ?
1) Fix the OOM killer inaccuracy by using counter sum (iteration on all
cpu counters) in task selection. This may slow down the oom killer,
but would at least fix its current inaccuracy issues. This could be
backported to stable kernels.
2) Introduce the hierarchical percpu counters on top, as a oom killer
task selection performance optimization (reduce latency of oom kill).
This way, (2) becomes purely a performance optimization, so it's easy
to bissect and revert if it causes issues.
Yes, this makes more sense.
I agree that bringing a fix along with a performance optimization within
a single commit makes it hard to backport to stable, and tricky to
revert if it causes problems.
As for finding other users of the hpcc, I have ideas, but not so much
time available to try them out, as I'm pretty much doing this in my
spare time.
I do understand this constrain and motivation to have OOM situation
addressed with a priority. I am pretty sure that if you see issues in
OOM path then other consumers of get_mm_counter would be affected as
well. Namely /proc/<pid>/stat.
Indeed /proc/<pid>/stat (implemented in fs/proc/array.c:do_task_stat())
uses get_mm_rss() which currently exports the approximated value to
userspace.
There might be others but I can imagine
that some of them are more performance than precision sensitive.
Agreed.
All that being said it seems that we need slow-and-precise and
fast-approximate interfaces to have incremental path for other users as
well. Looking at patch 1 it seems there are interfaces available for
that. I think it would be great to call those out explicitly in the
highlevel doc to give some guidance what to use when with what kind of
expectations.
I figured I'd first focus on the oom killers internals before tackling
the userspace ABI aspect of the problem, but since you're bringing it
up, here is what I have in mind, more or less:
- Introduce new proc files, e.g.
/proc/<pid>/rss/approximate
/proc/<pid>/rss/precise
Where the "approximate" file would export the following lines for each
page type (MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS, MM_SHMPAGES,
allowing future additions):
<page type> <approximate> <precise_sum_min> <precise_sum_max>
And "precise" would export lines for each page type:
<page type> <precise_sum>
The key thing here is to have different files to query approximated
vs precise values, so we don't have the overhead of the precise sum
when all we need is an approximation.
This would expose all the bits and pieces needed to allow userspace to
implement something similar to the 2-pass algorithm I'm proposing for
the OOM killer, but tweaked for other use-cases.
This proposed ABI is purely hypothetical at this stage. Please let me
know if you have something different in mind.
When you mention "highlevel doc", which document do you have in mind ?
Something related to lib/percpu_counter_tree.c or to the /proc ABI ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com