Hi,
I've hit a full GPU deadlock on Snapdragon X1E running the Zed editor.
The shrinker self-deadlocks when msm_gem_fault() triggers direct reclaim
while holding the VM's dma_resv.
System info
-----------
Kernel: 7.0.0-32-qcom-x1e #32-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 21 00:33:58
UTC 2026
Hardware: Qualcomm X1E (Snapdragon X Elite), Adreno GPU (msm_dpu)
Distro: Ubuntu (linux-qcom-x1e package)
Trigger
-------
Running Zed editor, which uses WebAssembly language servers (multiple
/memfd:wasm-memory-image allocations) and DMA-BUF GPU interaction.
Deadlock sequence
-----------------
1. msm_gem_fault() locks the GEM object (via msm_gem_lock_interruptible).
For NO_SHARE objects, obj->resv IS the VM's dma_resv.
2. get_pages() -> drm_prime_pages_to_sg() -> sg_kmalloc() -> alloc_pages()
with GFP_KERNEL (unconditional in drm_prime.c).
3. CMA is nearly exhausted (3.2 MB / 128 MB free), so the page allocator
enters direct reclaim.
4. Direct reclaim calls msm_gem_shrinker_scan(), which calls evict().
5. evict() -> with_vm_locks() tries dma_resv_lock(vm->resv, ticket) on
a NON-NO_SHARE object attached to the same VM.
6. vm->resv is already held by the same thread from step 1. Deadlock.
The code comment in with_vm_locks() is relevant:
"Since we already skip the case when the VM and obj share a resv
(ie. _NO_SHARE objs), we don't expect to hit a double-locking
scenario..."
The check skips ww_mutex locking when resv == obj->resv (the NO_SHARE case
for the object being evicted). But the deadlock is cross-object: a NO_SHARE
object caused the fault, locked the VM resv, and the shrinker tries to evict
a different non-NO_SHARE object whose VM resv is the same - but whose
obj->resv is different. The skip doesn't trigger, and the lock is already
held by the same thread via a different ww_acquire_ctx, so -EALREADY can't
be returned.
Dmesg traces
------------
[ 6390.715487] INFO: task zed-editor:27479 is blocked on a mutex likely
owned by task zed-editor:27479.
[ 6513.594800] task:zed-editor state:D stack:0 pid:27479
[ 6513.594830] __ww_mutex_lock_slowpath+0x20/0x48
[ 6513.594835] ww_mutex_lock+0xe8/0x178
[ 6513.594839] with_vm_locks+0x78/0x1a8 [msm]
[ 6513.594884] evict+0x70/0xc8 [msm]
[ 6513.594927] msm_gem_shrinker_scan+0x1a4/0x538 [msm]
[ 6513.594975] do_shrink_slab+0x164/0x640
[ 6513.594988] shrink_one+0x9c/0x1d8
[ 6513.595004] do_try_to_free_pages+0xe0/0x5a0
[ 6513.595015] __alloc_pages_slowpath.constprop.0+0x25c/0xb18
[ 6513.595048] sg_kmalloc+0x44/0x60
[ 6513.595054] sg_alloc_append_table_from_pages+0x238/0x480
[ 6513.595065] drm_prime_pages_to_sg+0xac/0x150
[ 6513.595113] msm_gem_fault+0x48/0x170 [msm]
[ 6513.595155] __do_fault+0x44/0x200
[ 6513.595179] do_page_fault+0x1fc/0x870
System memory pressure (at time of deadlock)
---------------------------------------------
CMA total: 32768 pages (128 MB)
CMA free: 810 pages (3.2 MB)
allocstall (normal): 519
allocstall (movable): 3387
compact_fail: 7151
compact_success: 886
direct reclaim pages scanned: 1,194,389
Shmem: 2.5 GB (likely WASM memfds)
Root cause
----------
The unconditional GFP_KERNEL in drm_prime_pages_to_sg() (via
sg_alloc_table_from_pages_segment) permits direct reclaim while the
calling thread holds the VM's dma_resv. When memory is constrained,
the shrinker re-enters the dma_resv lock path on the same VM, causing
a self-deadlock.
Proposed fix
------------
Suppress direct reclaim during the fault path's page allocation by
wrapping get_pages() with memalloc_noreclaim_save/restore:
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -10,6 +10,7 @@
#include <linux/spinlock.h>
#include <linux/shmem_fs.h>
#include <linux/dma-buf.h>
+#include <linux/sched/mm.h>
#include <drm/drm_dumb_buffers.h>
#include <drm/drm_prime.h>
@@ -282,6 +283,7 @@ static vm_fault_t msm_gem_fault(struct vm_fault *vmf)
struct page **pages;
unsigned long pfn;
pgoff_t pgoff;
+ unsigned int noreclaim_flag;
int err;
vm_fault_t ret;
@@ -300,8 +302,13 @@ static vm_fault_t msm_gem_fault(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}
+ /* Disable direct reclaim while holding GEM/VM locks to avoid
+ * self-deadlock in the shrinker (drm_prime_pages_to_sg uses
+ * GFP_KERNEL unconditionally).
+ */
+ noreclaim_flag = memalloc_noreclaim_save();
pages = get_pages(obj);
+ memalloc_noreclaim_restore(noreclaim_flag);
if (IS_ERR(pages)) {
ret = vmf_error(PTR_ERR(pages));
This is the standard approach used across DRM drivers to prevent reclaim
while holding dma_resv locks.
Notes
-----
- Bug report assisted by Kimi K2.6 (language model).
- The original hung task traces above timestamp 6390 were not captured;
the ring buffer may have wrapped.