Hi, sorry to jump in this late but the timing of previous versions didn't really work well for me.
On Sun 11-01-26 14:49:57, Mathieu Desnoyers wrote: [...] > Here is a (possibly incomplete) list of the prior approaches that were > used or proposed, along with their downside: > > 1) Per-thread rss tracking: large error on many-thread processes. > > 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% > increased system time in make test workloads [1]. Moreover, the > inaccuracy increases with O(n^2) with the number of CPUs. > > 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), > error is high with systems that have lots of NUMA nodes (32 times > the number of NUMA nodes). > > The approach proposed here is to replace this by the hierarchical > per-cpu counters, which bounds the inaccuracy based on the system > topology with O(N*logN). The concept of hierarchical pcp counter is interesting and I am definitely not opposed if there are more users that would benefit. >From the OOM POV, IIUC the primary problem is that get_mm_counter (percpu_counter_read_positive) is too imprecise on systems when the task is moving around a large number of cpus. In the list of alternative solutions I do not see percpu_counter_sum_positive to be mentioned. oom_badness() is a really slow path and taking the slow path to calculate a much more precise value seems acceptable. Have you considered that option? -- Michal Hocko SUSE Labs
