https://bugzilla.kernel.org/show_bug.cgi?id=60533

Artem S. Tashkinov ([email protected]) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |WILL_NOT_FIX

--- Comment #44 from Artem S. Tashkinov ([email protected]) ---
Further testing narrows this down substantially.

With Chrome started as:

```
google-chrome-stable --disable-gpu --disable-gpu-compositing
```

the reproducer crashes only the offending tab. It does not trigger a
system-wide OOM and does not kill unrelated browser processes.

With normal GPU acceleration enabled, the same workload rapidly causes a global
OOM, kills unrelated Chrome/Firefox processes, and leaves memory exhaustion
that is not explained by ordinary Chrome process RSS.

This strongly suggests that the catastrophic failure depends on the accelerated
graphics path:

```
Chrome → ANGLE/Mesa → DRM/GEM → xe
```

I do not yet claim that xe alone is necessarily at fault; this could involve
Chrome, ANGLE/Mesa, shared DRM/GEM code, or xe object
lifetime/accounting/reclaim. However, this report appears to belong on the DRM
xe GitLab tracker rather than kernel.org Bugzilla.

I have opened the corresponding report here:
https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/8393

The issue must affect i915 similarly but I'm not sure about amdgpu.

(In reply to Eero Tamminen from comment #43)
> There's now DMEM cgroup for limiting device (e.g. GPU) memory:
> https://docs.kernel.org/admin-guide/cgroup-v2.html#dmem
> 
> But I guess this bug is only about iGPUs i.e. when GPU memory is shared with
> the CPU?

`dmem` is clearly relevant, but I do not think it makes this purely an iGPU
question.

My reading is that `dmem` limits allocations in registered device-memory
regions. That is an obvious fit for dGPU VRAM, and perhaps for explicitly
exposed iGPU regions such as stolen memory. It is less clear whether it covers
the host-RAM-backed GEM/TTM allocations that are often involved in UMA paths,
or dGPU objects that have been evicted/migrated to system RAM.

So there seem to be two related but distinct failure modes:

1. Device-local-memory exhaustion: a dGPU process consumes VRAM. `dmem.max`
looks like the intended containment mechanism.

2. Host-RAM exhaustion induced by GPU objects: buffers, imported dma-bufs,
staging allocations, page tables, shmem-backed GEM objects, or migrated BOs
consume ordinary RAM. This is especially easy to turn into a global OOM on an
iGPU/UMA system because “GPU memory” is ultimately the same DRAM pool as
everything else.

The second manifestation may indeed be mostly visible on iGPUs, but I would not
call the underlying problem iGPU-exclusive without checking where the relevant
allocations are charged. A dGPU can still consume host RAM through
system-memory placements and migration paths.

The useful question for this report might be: does the allocation path at issue
appear in `dmem.current` for the offending cgroup, or is it only charged
through the normal memory cgroup / not attributed usefully at all? If
`dmem.current` stays flat while host RAM is exhausted, then `dmem` does not
address that particular path.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.

Reply via email to