On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <[email protected]> wrote:
> From: Hui Zhu <[email protected]> > > Overview: > This series introduces BPF struct_ops support for the memory controller, > enabling userspace BPF programs to implement custom, dynamic memory > management policies per cgroup. The feature allows BPF programs to hook > into the core reclaim and charge paths without requiring kernel > modifications, providing a flexible alternative to static knobs such as > memory.low and memory.min. > > The series enables two complementary use cases. > > Dynamic memory protection: static memory protection thresholds > (memory.low, memory.min) are poor fits for workloads whose actual memory > activity varies over time. A high-priority cgroup holding a large working > set but temporarily idle will still suppress reclaim on its siblings, > wasting available memory. A BPF-driven approach can observe real workload > activity -- page faults, charge/uncharge events -- and activate or > withdraw protection dynamically. The test results at the end of this > letter quantify the difference: in a scenario where the high-priority > cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly > 37x higher throughput than with static memory.low. > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > hooks, combined with the BPF workqueue mechanism and the new > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > proactive background reclaim without blocking the charge path. The > pattern works as follows: the memcg_charged callback tracks accumulated > memory usage; when usage crosses a configurable threshold, it enqueues an > asynchronous work item via bpf_wq_start() and returns immediately without > throttling the charging task. The workqueue callback then invokes > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > itself to continue. This allows a BPF program to keep a cgroup's > footprint below its hard limit (memory.max) entirely in the background, > avoiding the OOM killer or direct-reclaim stalls that would otherwise > occur. The selftest for this feature (patch 10/11) validates the > mechanism concretely: a workload that writes and mmaps a 64 MB file inside > a 32 MB cgroup reliably triggers memory.events "max" events without BPF; > with the async reclaim program attached, the "max" counter does not > increase at all across the same workload. > Hi Hui, Thanks for the series. Would it not be simpler to just have another memcg knob, something like memory.high_async. When memory usage > memory.high_async, queue a per-memcg work item that calls try_to_free_mem_cgroup_pages() until usage drops back below some threshold. I am not sure I see what programability aspect from bpf you need here. Thanks > > 08/11 selftests/bpf: Add tests for memcg_bpf_ops > Adds prog_tests/memcg_ops.c covering three scenarios: > memcg_charged-only throttling, below_low + memcg_charged > interaction, and below_min + memcg_charged interaction. A > tracepoint on memcg:count_memcg_events (PGFAULT) is used to > detect memory pressure and trigger hooks accordingly. > > 09/11 selftests/bpf: Add test for memcg_bpf_ops hierarchies > Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a > three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the > root, override at the middle level without the flag, then assert > that attaching to the leaf correctly fails with -EBUSY. > > 10/11 selftests/bpf: Add selftest for memcg async reclaim via BPF > Demonstrates and validates asynchronous memory reclaim: a BPF > program uses the memcg_charged/memcg_uncharged hooks to track > accumulated usage and, when a threshold is exceeded, enqueues a > bpf_wq_start() workqueue item that calls > bpf_try_to_free_mem_cgroup_pages() without blocking the charge > path. The test asserts that with the BPF program active, > memory.events "max" events do not increase under a workload > that would otherwise exceed the hard limit. > > 11/11 samples/bpf: Add memcg priority control and async reclaim example > Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c) > demonstrating both features. The BPF side monitors PGFAULT > events on a high-priority cgroup; when the per-second fault > count crosses a configurable threshold, it activates below_low > or below_min protection for the high-priority cgroup and/or > applies a charge delay to the low-priority cgroup. Six > struct_ops variants are exported so userspace can attach only > the hooks needed. Async reclaim is optionally combined with > priority throttling via a shared low-cgroup ops map. > > Test Environment: > The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using > a tmpfs-backed file on the host as a swap device to reduce I/O impact. > Two cgroups are created -- high (high-priority) and low (low-priority) > -- and each test runs two concurrent stress-ng workloads, one per > cgroup, each requesting 3 GB of memory. > > # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low > # free -h > total used free shared buff/cache available > Mem: 1.9Gi 317Mi 1.6Gi 1.0Mi 144Mi 1.6Gi > Swap: 4.0Gi 0B 4.0Gi > > Baseline: no memory priority policy: > Both cgroups run without any reclaim protection. Results are roughly > equal, as expected: > > cgroup bogo ops/s > high 4,979 > low 4,927 > > Test 1: memory.low protection: > Setting memory.low on the high-priority cgroup protects it from > reclaim, at the cost of pushing reclaim pressure onto the low-priority > cgroup: > > # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low > > cgroup bogo ops/s > high 450,290 > low 11,307 > > The high-priority cgroup benefits significantly, but memory.low relies > on static usage thresholds and cannot adapt to actual workload > behavior. > > Test 2: memory.low with an idle high-priority task: > Here the high-priority cgroup runs a Python script that allocates 3 GB > and then sleeps, simulating a low-activity but memory-holding workload. > Because the process is idle, it generates no page faults and does not > actively use its memory. Yet memory.low still protects it, continuing > to suppress the low-priority cgroup's performance: > > cgroup bogo ops/s > low 14,757 > > The low-priority cgroup remains significantly throttled despite the > high-priority cgroup being effectively idle -- a clear limitation of > static memory.low control. > > Test 3: memcg eBPF -- dynamic priority control: > memcg is a sample program introduced in this patch series > (samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that > monitors PGFAULT events in the high-priority cgroup. When the > per-second fault count exceeds a configured threshold, the hook > activates below_min protection for one second; otherwise the cgroup > receives no special treatment. > > # ./memcg --low_path=/sys/fs/cgroup/low \ > --high_path=/sys/fs/cgroup/high \ > --threshold=1 --use_below_min > Successfully attached! > > 3a. Both cgroups under active memory pressure: > > When both cgroups run stress-ng, the high-priority cgroup generates > frequent page faults and the BPF hook activates protection, matching > the behavior of memory.low: > > cgroup bogo ops/s > high 404,392 > low 11,404 > > 3b. High-priority cgroup is idle (Python + sleep): > > Because the sleeping Python process generates no page faults, the BPF > hook never activates, and the low-priority cgroup is free to reclaim > memory normally: > > cgroup bogo ops/s > low 551,083 > > This is a ~37x improvement over the equivalent memory.low scenario > (Test 2), demonstrating that eBPF-driven dynamic control can > accurately reflect actual workload activity and avoid unnecessary > protection of idle high-priority tasks. > > Summary: > Scenario low-cgroup bogo ops/s > Baseline (no policy) ~4,927 > memory.low, both active ~11,307 > memory.low, high idle ~14,757 > memcg eBPF, both active ~11,404 > memcg eBPF, high idle ~551,083 > > References: > [1] > https://patchew.org/linux/[email protected]/ > > Changelog: > v7: > Change base commits of "mm: BPF OOM" to v3. > Some fixes according to the comments of bpf-ci. > Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged > hook for tracking uncharge events. > Update below_low and below_min hooks to receive elow/emin and usage > as explicit arguments. > Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim > to BPF programs. > Add selftest for BPF-driven asynchronous page reclaim. > Extend samples/bpf/memcg to support async reclaim in addition to > priority throttling. > v6: > Based on the bot+bof-ci comments, fixed the following issues. > Added fast-path check with unlikely() before SRCU lock acquisition to > optimize the no-BPF case in BPF_MEMCG_CALL. > Add missing newline in pr_warn message to bpf_memcontrol_init. > Added comprehensive child process exit status checking with WIFEXITED() > and WEXITSTATUS(), and added zombie process prevention in > real_test_memcg_ops. > Changed malloc() to calloc() for BSS data allocation in all test > functions and samples main function. > Change srcu_read_lock(&memcg_bpf_srcu) to > lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online > and memcontrol_bpf_offline. > v5: > Based on the bot+bof-ci comments, fixed the following issues. > Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable > declaration to the beginning of need_threshold() function. > The 'u64 current_ts' variable must be declared before any > executable statements > Improved input validation in samples/bpf/memcg.c by adding a new > parse_u64() helper function. This function properly handles errors > from strtoull() and provides better error messages when parsing > threshold and over_high_ms command-line arguments. > Move check for prog->sleepable after validating member offsets in > mm/bpf_memcontrol.c bpf_memcg_ops_check_member. > Fixed sscanf return value checking in prog_tests/memcg_ops.c. > Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because > sscanf returns the number of successfully matched items, not a negative > value on error. This makes the test more reliable when reading timing > data from temporary files. > v4: > Fix the issues according to the comments from bot+bof-ci. > According to JP Kobryn's comments, move exit(0) from > real_test_memcg_ops_child_work to real_test_memcg_ops. > Fix issues in the bpf_memcg_ops_reg function. > v3: > According to the comments from Michal Koutný and Chen Ridong, update hooks > to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and > handle_cgroup_offline. > According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE > support to memcg_bpf_ops. > v2: > According to Tejun Heo's comments, rebased on Roman Gushchin's BPF > OOM patch series [1] and added hierarchical delegation support. > According to the comments from Roman Gushchin and Michal Hocko, designed > concrete use case scenarios and provided test results. > > Hui Zhu (7): > bpf: Pass flags in bpf_link_create for struct_ops > mm: memcontrol: Add BPF struct_ops for memory controller > mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc > selftests/bpf: Add tests for memcg_bpf_ops > selftests/bpf: Add test for memcg_bpf_ops hierarchies > selftests/bpf: Add selftest for memcg async reclaim via BPF > samples/bpf: Add memcg priority control and async reclaim example > > Roman Gushchin (4): > bpf: move bpf_struct_ops_link into bpf.h > bpf: allow attaching struct_ops to cgroups > libbpf: fix return value on memory allocation failure > libbpf: introduce bpf_map__attach_struct_ops_opts() > > MAINTAINERS | 6 + > include/linux/bpf-cgroup-defs.h | 3 + > include/linux/bpf-cgroup.h | 16 + > include/linux/bpf.h | 10 + > include/linux/memcontrol.h | 250 ++++++- > include/uapi/linux/bpf.h | 5 +- > kernel/bpf/bpf_struct_ops.c | 67 +- > kernel/bpf/cgroup.c | 46 ++ > mm/bpf_memcontrol.c | 355 +++++++++- > mm/memcontrol.c | 43 +- > samples/bpf/.gitignore | 1 + > samples/bpf/Makefile | 8 +- > samples/bpf/memcg.bpf.c | 380 +++++++++++ > samples/bpf/memcg.c | 411 ++++++++++++ > tools/include/uapi/linux/bpf.h | 3 +- > tools/lib/bpf/libbpf.c | 22 +- > tools/lib/bpf/libbpf.h | 14 + > tools/lib/bpf/libbpf.map | 1 + > tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++ > tools/testing/selftests/bpf/cgroup_helpers.h | 2 + > .../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++ > .../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++ > .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++ > tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++ > 24 files changed, 2952 insertions(+), 34 deletions(-) > create mode 100644 samples/bpf/memcg.bpf.c > create mode 100644 samples/bpf/memcg.c > create mode 100644 > tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c > create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c > create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c > > -- > 2.43.0 > >

