On Wed 14-01-26 09:59:14, Mathieu Desnoyers wrote: > Use hierarchical per-cpu counters for RSS tracking to improve the > accuracy of per-mm RSS sum approximation on large many-core systems [1]. > This improves the accuracy of the RSS values returned by proc > interfaces. > > This is also a preparation step to introduce a 2-pass OOM killer task > selection which leverages the approximation and accuracy ranges to > quickly eliminate tasks which are outside of the range of the current > selection, and thus reduce the latency introduced by execution of the > OOM killer. > > Here is a (possibly incomplete) list of the prior approaches that were > used or proposed, along with their downside: > > 1) Per-thread rss tracking: large error on many-thread processes. > > 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% > increased system time in make test workloads [1]. Moreover, the > inaccuracy increases with O(n^2) with the number of CPUs. > > 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), > error is high with systems that have lots of NUMA nodes (32 times > the number of NUMA nodes). > > 4) Use a percise per-cpu counter sum for each counter value query: > Requires iteration on each possible CPUs for each sum, which > adds overhead (and thus increases OOM killer latency) on large > many-core systems running many processes. > > The approach proposed here is to replace the per-cpu counters by the > hierarchical per-cpu counters, which bounds the inaccuracy based on the > system topology with O(N*logN). > > * Testing results: > > Test hardware: 2 sockets AMD EPYC 9654 96-Core Processor (384 logical CPUs > total) > > Methodology: > > Comparing the current upstream implementation with the hierarchical > counters is done by keeping both implementations wired up in parallel, > and running a single-process, single-threaded program which hops > randomly across CPUs in the system, calling mmap(2) and munmap(2) on > random CPUs, keeping track of an array of allocated mappings, randomly > choosing entries to either map or unmap. > > get_mm_counter() is instrumented to compare the upstream counter > approximation to the precise value, and print the delta when going over > a given threshold. The delta of the hierarchical counter approximation > to the precise value is also printed for comparison. > > After a few minutes running this test, the upstream implementation > counter approximation reaches a 1GB delta from the > precise value, compared to 80MB delta with the hierarchical counter. > The hierarchical counter provides a guaranteed maximum approximation > inaccuracy of 192MB on that hardware topology. > > * Fast path implementation comparison > > The new inline percpu_counter_tree_add() uses a this_cpu_add_return() > for the fast path (under a certain allocation size threshold). Above > that, it calls a slow path which "trickles up" the carry to upper level > counters with atomic_add_return. > > In comparison, the upstream counters implementation calls > percpu_counter_add_batch which uses this_cpu_try_cmpxchg() on the fast > path, and does a raw_spin_lock_irqsave above a certain threshold. > > The hierarchical implementation is therefore expected to have less > contention on mid-sized allocations than the upstream counters because > the atomic counters tracking those bits are only shared across nearby > CPUs. In comparison, the upstream counters immediately use a global > spinlock when reaching the threshold. > > * Benchmarks > > Using will-it-scale page_fault1 benchmarks to compare the upstream > counters to the hierarchical counters. This is done with hyperthreading > disabled. The speedup is within the standard deviation of the upstream > runs, so the overhead is not significant. > > upstream hierarchical speedup > page_fault1_processes -s 100 -t 1 614783 615558 +0.1% > page_fault1_threads -s 100 -t 1 612788 612447 -0.1% > page_fault1_processes -s 100 -t 96 37994977 37932035 -0.2% > page_fault1_threads -s 100 -t 96 2484130 2504860 +0.8% > page_fault1_processes -s 100 -t 192 71262917 71118830 -0.2% > page_fault1_threads -s 100 -t 192 2446437 2469296 +0.1% > > This change depends on the following patch: > "mm: Fix OOM killer inaccuracy on large many-core systems" [2]
As mentioned in the previous patch, it would be great to explicitly mention what is the memory price for the new tracking data structure. Other than that this seems like a generally useful improvement for larger systems and it is my understanding that it doesn't add almost any overhead on small end systems, correct? -- Michal Hocko SUSE Labs
