> > On Tue 26-05-26 10:20:00, Hui Zhu wrote: > Hi Michal,
> > > > From: Hui Zhu <[email protected]> > > > > Overview: > > This series introduces BPF struct_ops support for the memory controller, > > enabling userspace BPF programs to implement custom, dynamic memory > > management policies per cgroup. The feature allows BPF programs to hook > > into the core reclaim and charge paths without requiring kernel > > modifications, providing a flexible alternative to static knobs such as > > memory.low and memory.min. > > > > The series enables two complementary use cases. > > > > Dynamic memory protection: static memory protection thresholds > > (memory.low, memory.min) are poor fits for workloads whose actual memory > > activity varies over time. A high-priority cgroup holding a large working > > set but temporarily idle will still suppress reclaim on its siblings, > > wasting available memory. A BPF-driven approach can observe real workload > > activity -- page faults, charge/uncharge events -- and activate or > > withdraw protection dynamically. > > > Why the same cannot be achieved by dynamically changing protection? Dynamically adjusting memory.low or memory.min is indeed an option, but it has a practical drawback: in many production environments these values are managed and pushed down by a cluster-level orchestrator (e.g. a container runtime or resource manager). Modifying them from a separate BPF-based agent risks conflicts with the orchestrator's own control loop and makes the system harder to reason about. Beyond that, the intended use case requires rapid, short-lived adjustments -- reacting to bursts of page faults or PSI spikes and reverting just as quickly once the pressure subsides. Mutating the static knobs for that purpose feels like the wrong abstraction: the knobs express policy intent, while what we need is a transient override that sits on top of that policy. The hooks are therefore not meant to replace the existing limits, but to complement them: the orchestrator continues to own memory.low / memory.min, while a BPF program makes small, brief corrections in response to observed runtime behavior. > > > > > The test results at the end of this > > letter quantify the difference: in a scenario where the high-priority > > cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly > > 37x higher throughput than with static memory.low. > > > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > > hooks, combined with the BPF workqueue mechanism and the new > > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > > proactive background reclaim without blocking the charge path. The > > pattern works as follows: the memcg_charged callback tracks accumulated > > memory usage; when usage crosses a configurable threshold, it enqueues an > > asynchronous work item via bpf_wq_start() and returns immediately without > > throttling the charging task. The workqueue callback then invokes > > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > > itself to continue. This allows a BPF program to keep a cgroup's > > footprint below its hard limit (memory.max) entirely in the background, > > avoiding the OOM killer or direct-reclaim stalls that would otherwise > > occur. > > > How do you account the overall work done to the specific memcg as the > large part of the reclaim is done from WQ context? One approach to attribute the reclaim work accurately to the target memcg would be to expose a kfunc that creates a kthread_worker and attaches it to a specific cgroup. Reclaim work enqueued to that worker would then run in a context already associated with the target memcg, so the accounting would naturally fall to the right cgroup without any extra bookkeeping. The tradeoff is additional complexity: creating a per-cgroup worker introduces resource overhead and lifecycle management concerns (e.g. when should the worker be torn down). Whether that cost is justified depends on how strictly the caller needs the reclaim to be attributed. That said, I am not certain this is the right direction yet and would welcome your thoughts on whether this is worth pursuing, or whether there is a simpler mechanism I am overlooking. > Also when introducing a BPF hook please focus on describing why existing > interfaces fail to achieve what you need. For the async reclaim why it > is not practical or feasible to use userspace driven memory reclaim. Noted, and thank you for both points. In the next revision I will add a dedicated section to each hook's description covering: Why existing interfaces are insufficient. For the async reclaim case specifically, I will explain why userspace-driven reclaim (e.g. memory.reclaim, cgroup-aware madvise, or a dedicated reclaim daemon) is not practical: userspace cannot react at the granularity or latency required, and the round-trip through a syscall or procfs write introduces overhead that defeats the purpose of proactive reclaim. What gap the new hook fills that cannot be closed by tuning existing knobs. Best, Hui > -- > Michal Hocko > SUSE Labs >

