On 2026-02-25 21:06:05 [+0100], Maarten Lankhorst wrote:
> Hey,
Hi,
> After realizing the uncore lock only needed to be converted to a raw spinlock
> because the testcase was broken, I tested the alternative fix of using
> sleeping context only in the selftests:
> https://patchwork.freedesktop.org/patch/707063/?series=162145&rev=1
I have some memory that makeing uncore raw "solved" the tracing problem
but the latencies jumped at the same.
Since you did test XE only, xe does not show the lockup. I spills this
however:
|[ T632] ============================================
|[ T632] WARNING: possible recursive locking detected
|[ T632] 7.0.0-rc1-plain+ #5 Tainted: G U E
|[ T632] --------------------------------------------
|[ T632] kworker/u32:13/632 is trying to acquire lock:
|[ T632] ffff8efb858e7c58 (&fence->inline_lock){+.+.}-{2:2}, at:
dma_fence_add_callback+0x4b/0x100
|[ T632]
| but task is already holding lock:
|[ T632] ffff8efb524a1b58 (&fence->inline_lock){+.+.}-{2:2}, at:
dma_fence_add_callback+0x4b/0x100
|[ T632]
| other info that might help us debug this:
|[ T632] Possible unsafe locking scenario:
|[ T632] CPU0
|[ T632] ----
|[ T632] lock(&fence->inline_lock);
|[ T632] lock(&fence->inline_lock);
|[ T632]
| *** DEADLOCK ***
|[ T632] May be due to missing lock nesting notation
|[ T632] 5 locks held by kworker/u32:13/632:
|[ T632] #0: ffffffffc0ba3540 (drm_sched_lockdep_map){+.+.}-{0:0}, at:
process_one_work+0x57a/0x600
|[ T632] #1: ffffcf7f020f7e48
((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at:
process_one_work+0x1dc/0x600
|[ T632] #2: ffff8efb524a1b58 (&fence->inline_lock){+.+.}-{2:2}, at:
dma_fence_add_callback+0x4b/0x100
|[ T632] #3: ffffffffb60fecc0 (rcu_read_lock){....}-{1:2}, at:
rt_spin_lock+0xe6/0x1d0
|[ T632] #4: ffffffffb60fecc0 (rcu_read_lock){....}-{1:2}, at:
__dma_fence_enable_signaling+0x59/0x320
|[ T632]
| stack backtrace:
|[ T632] CPU: 6 UID: 0 PID: 632 Comm: kworker/u32:13 Tainted: G U E
7.0.0-rc1-plain+ #5 PREEMPT_{RT,(lazy)}
|[ T632] Tainted: [U]=USER, [E]=UNSIGNED_MODULE
|[ T632] Hardware name: LENOVO 20TD00GLGE/20TD00GLGE, BIOS R1EET64W(1.64 )
03/18/2025
|[ T632] Workqueue: drm_sched_run_job_work [gpu_sched]
|[ T632] Call Trace:
|[ T632] <TASK>
|[ T632] dump_stack_lvl+0x6e/0xa0
|[ T632] print_deadlock_bug.cold+0xc0/0xcd
|[ T632] __lock_acquire+0x1232/0x2180
|[ T632] lock_acquire+0xca/0x2f0
|[ T632] rt_spin_lock+0x3f/0x1d0
|[ T632] dma_fence_add_callback+0x4b/0x100
|[ T632] dma_fence_chain_enable_signaling+0x11e/0x280
|[ T632] __dma_fence_enable_signaling+0xc8/0x320
|[ T632] dma_fence_add_callback+0x53/0x100
|[ T632] drm_sched_entity_pop_job+0xf5/0x550 [gpu_sched]
|[ T632] drm_sched_run_job_work+0x136/0x470 [gpu_sched]
|[ T632] process_one_work+0x21d/0x600
|[ T632] worker_thread+0x1d9/0x3b0
|[ T632] kthread+0xf4/0x130
|[ T632] ret_from_fork+0x3a5/0x430
|[ T632] ret_from_fork_asm+0x1a/0x30
|[ T632] </TASK>
Nothing else so far.
> With that the reset selftest works as expected.
>
> But I do see some weird lockdep splats and aborts after that fixed the uncore
> lock testcases:
> https://patchwork.freedesktop.org/series/162145/
>
> I believe it could be a different instance of:
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_162145v1/bat-dg2-9/igt@[email protected]#dmesg-warnings904
>
> Which is tracked under:
>
> https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/15759
>
> Perhaps those are related to what you are seeing?
Not sure if it was the uncore in both cases. If you have an update
series somewhere I could pull and check. In meantime I would look what
causes the lockup on i915.
> Also don't use that series for anything but CI results, I rather want to
> submit
> a new version of this series.
So I am brave for using it on my actual HW then ;)
> Kind regards,
> ~Maarten Lankhorst
Sebastian