https://bugzilla.kernel.org/show_bug.cgi?id=60533
Artem S. Tashkinov ([email protected]) changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution|--- |WILL_NOT_FIX --- Comment #44 from Artem S. Tashkinov ([email protected]) --- Further testing narrows this down substantially. With Chrome started as: ``` google-chrome-stable --disable-gpu --disable-gpu-compositing ``` the reproducer crashes only the offending tab. It does not trigger a system-wide OOM and does not kill unrelated browser processes. With normal GPU acceleration enabled, the same workload rapidly causes a global OOM, kills unrelated Chrome/Firefox processes, and leaves memory exhaustion that is not explained by ordinary Chrome process RSS. This strongly suggests that the catastrophic failure depends on the accelerated graphics path: ``` Chrome → ANGLE/Mesa → DRM/GEM → xe ``` I do not yet claim that xe alone is necessarily at fault; this could involve Chrome, ANGLE/Mesa, shared DRM/GEM code, or xe object lifetime/accounting/reclaim. However, this report appears to belong on the DRM xe GitLab tracker rather than kernel.org Bugzilla. I have opened the corresponding report here: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/8393 The issue must affect i915 similarly but I'm not sure about amdgpu. (In reply to Eero Tamminen from comment #43) > There's now DMEM cgroup for limiting device (e.g. GPU) memory: > https://docs.kernel.org/admin-guide/cgroup-v2.html#dmem > > But I guess this bug is only about iGPUs i.e. when GPU memory is shared with > the CPU? `dmem` is clearly relevant, but I do not think it makes this purely an iGPU question. My reading is that `dmem` limits allocations in registered device-memory regions. That is an obvious fit for dGPU VRAM, and perhaps for explicitly exposed iGPU regions such as stolen memory. It is less clear whether it covers the host-RAM-backed GEM/TTM allocations that are often involved in UMA paths, or dGPU objects that have been evicted/migrated to system RAM. So there seem to be two related but distinct failure modes: 1. Device-local-memory exhaustion: a dGPU process consumes VRAM. `dmem.max` looks like the intended containment mechanism. 2. Host-RAM exhaustion induced by GPU objects: buffers, imported dma-bufs, staging allocations, page tables, shmem-backed GEM objects, or migrated BOs consume ordinary RAM. This is especially easy to turn into a global OOM on an iGPU/UMA system because “GPU memory” is ultimately the same DRAM pool as everything else. The second manifestation may indeed be mostly visible on iGPUs, but I would not call the underlying problem iGPU-exclusive without checking where the relevant allocations are charged. A dGPU can still consume host RAM through system-memory placements and migration paths. The useful question for this report might be: does the allocation path at issue appear in `dmem.current` for the offending cgroup, or is it only charged through the normal memory cgroup / not attributed usefully at all? If `dmem.current` stays flat while host RAM is exhausted, then `dmem` does not address that particular path. -- You may reply to this email to add a comment. You are receiving this mail because: You are on the CC list for the bug. You are the assignee for the bug.
