On Mon, Mar 2, 2026 at 7:51 AM Christian König <[email protected]> wrote: > > On 3/2/26 16:40, Shakeel Butt wrote: > > +TJ > > > > On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote: > >> On 3/2/26 15:15, Shakeel Butt wrote: > >>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote: > >>>> On 2/24/26 20:28, Dave Airlie wrote: > >>> [...] > >>>> > >>>>> This has been a pain in the ass for desktop for years, and I'd like to > >>>>> fix it, the HPC use case if purely a driver for me doing the work. > >>>> > >>>> Wait a second. How does accounting to cgroups help with that in any way? > >>>> > >>>> The last time I looked into this problem the OOM killer worked based on > >>>> the per task_struct stats which couldn't be influenced this way. > >>>> > >>> > >>> It depends on the context of the oom-killer. If the oom-killer is > >>> triggered due > >>> to memcg limits then only the processes in the scope of the memcg will be > >>> targetted by the oom-killer. With the specific setting, the oom-killer > >>> can kill > >>> all the processes in the target memcg. > >>> > >>> However nowadays the userspace oom-killer is preferred over the kernel > >>> oom-killer due to flexibility and configurability. Userspace oom-killers > >>> like > >>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized > >>> environments. Such oom-killers looks at memcg stats and hiding something > >>> something from memcg i.e. not charging to memcg will hide such usage from > >>> these > >>> oom-killers. > >> > >> Well exactly that's the problem. Android's oom killer is *not* using memcg > >> exactly because of this inflexibility. > > > > Are you sure Android's oom killer is not using memcg? From what I see in the > > documentation [1], it requires memcg.
LMKD used to use memcg v1 for memory.pressure_level, but that has been replaced by PSI which is now the default configuration. I deprecated all configurations with memcg v1 dependencies in January. We plan to remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels reach EOL. > My bad, I should have been wording that better. > > The Android OOM killer is not using memcg for tracking GPU memory > allocations, because memcg doesn't have proper support for tracking shared > buffers. > > In other words GPU memory allocations are shared by design and it is the norm > that the process which is using it is not the process which has allocated it. > > What we would need (as a start) to handle all of this with memcg would be to > accounted the resources to the process which referenced it and not the one > which allocated it. > > I can give a full list of requirements which would be needed by cgroups to > cover all the different use cases, but it basically means tons of extra > complexity. Yeah this is right. We usually prioritize fast kills rather than picking the biggest offender though. Application state (foreground / background) is the primary selector, however LMKD does have a mode (kill_heaviest_task) where it will pick the largest task within a group of apps sharing the same application state. For this it uses RSS from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out of tree interface in procfs for accounting dmabufs used by a process. It tracks FD references and map references as they come and go, and only counts any buffer once for a process regardless of the number and type of references a process has to the same buffer. I dislike it greatly. My original intention was to use the dmabuf BPF iterator we added to scan maps and FDs of a process for dmabufs on demand. Very simple and pretty fast in BPF. This wouldn't support high watermark tracking, so I was forced into doing something else for per-process accounting. To be fair, the HWM tracking has detected a few application bugs where 4GB of system memory was inadvertently consumed by dmabufs. The BPF iterator is currently used to support accounting of buffers not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice improvement for that over the old sysfs interface. I hope to replace the slow scanning of procfs for dmabufs in libdmabufinfo with BPF programs that use the dmabuf iterator, but that's not a priority for this year. Independent of all of that, memcg doesn't really work well for this because it's shared memory that can only be attributed to a single memcg, and the most common allocator (gralloc) is in a separate process and memcg than the processes using the buffers (camera, YouTube, etc.). I had a few patches that transferred the ownership of buffers to a new memcg when they were sent via Binder, but this used the memcg v1 charge moving functionality which is now gone because it was so complicated. But that only works if there is one user that should be charged for the buffer anyway. What if it is shared by multiple applications and services? > Regards, > Christian. > > > > > [1] https://source.android.com/docs/core/perf/lmkd > > > >> > >> See the multiple iterations we already had on that topic. Even including > >> reverting already upstream uAPI. > >> > >> The latest incarnation is that BPF is used for this task on Android. > >> > >> Regards, > >> Christian. >
