On 2026-02-17 14:16, Michele Palazzi wrote:
> Intermittent flip_done timeouts have been observed on AMD GPUs
> since kernel 6.12.
>
> Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
> incorrectly consume events meant for plane flips during cursor-only
> updates. This happens because cursor commits defer event delivery to
> the vblank handler, which checks (pflip_status != SUBMITTED). Since
> AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
> event slot for subsequent plane flips, leading to timeouts.
>
> The potential for a race was present since commit 473683a03495
> ("drm/amd/display: Create a file dedicated for CRTC"), then
> commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
> policy for older ASICs") made it happen by reducing vblank
> off-delay and making disables happen much more frequently
> between commits.
>
> Fix this by sending cursor-only vblank events immediately in
> amdgpu_dm_commit_planes(). Since cursor updates are committed to
> hardware immediately, deferring the event is unnecessary and
> creates race windows for event stealing or starvation if vblank
> is disabled before the handler runs.
>
> Tested on DCN 2.1, 3.2, and 3.5.
>
> Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy
> for older ASICs")
> Signed-off-by: Michele Palazzi <[email protected]>
> ---
> I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE
> first, 9070 XT now)
> since kernel 6.12. The hang occurs during normal desktop usage but is much
> easier to
> trigger under specific conditions involving cursor movements and plane
> updates.
>
> Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787
>
> Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
> Dual DP monitors, 2560x1440, 144Hz
> Desktop: KDE Plasma Wayland
>
> The hang was initially observed while using Cisco Webex
> (XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
> and screen share a window running Omnissa Horizon client. Then move the cursor
> around between the two monitors and the shared window.
> Under these conditions the hang usually occurs within a few hours.
>
> Enabling drm.debug masks the issue entirely, the overhead
> changes timing enough to close the race window.
> So i added debug printks to amdgpu_dm.c and used a small bpftrace script to
> log the
> pageflip lifecycle with per-thread tracking to debug.
>
> bpftrace script:
>
> config = { missing_probes = "warn" }
> BEGIN { printf("=== flip_done tracer started ===\n"); }
> kprobe:drm_crtc_vblank_on_config { printf("%lu
> drm_crtc_vblank_on_config\n", nsecs/1000000); }
> kprobe:drm_vblank_disable_and_save { printf("%lu
> drm_vblank_disable_and_save\n", nsecs/1000000); }
> kprobe:dm_pflip_high_irq { printf("%lu dm_pflip_high_irq\n",
> nsecs/1000000); }
> kprobe:drm_crtc_send_vblank_event { printf("%lu
> drm_crtc_send_vblank_event\n", nsecs/1000000); }
> kprobe:drm_vblank_put { printf("%lu drm_vblank_put\n",
> nsecs/1000000); }
> kprobe:drm_atomic_helper_commit_hw_done { printf("%lu
> drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
> kprobe:manage_dm_interrupts { printf("%lu
> manage_dm_interrupts\n", nsecs/1000000); }
> kprobe:drm_atomic_helper_wait_for_flip_done {
> @wait_start[tid] = nsecs;
> printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n",
> nsecs/1000000, tid);
> }
> kretprobe:drm_atomic_helper_wait_for_flip_done {
> $start = @wait_start[tid];
> $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
> if ($ms > 100) {
> printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited
> %lums [tid=%d]\n",
> nsecs/1000000, $ms, tid);
> } else {
> printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums
> [tid=%d]\n",
> nsecs/1000000, $ms, tid);
> }
> delete(@wait_start[tid]);
> }
> interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
> END { printf("=== stopped ===\n"); clear(@wait_start); }
>
> The timeout was captured at 17:35:41 CET. The trace timestamps
> match dmesg exactly (9942110ms = dmesg 9942.110s).
>
> dmesg output from the timeout:
>
> [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
> [ 9942.110380] [FLIP_DEBUG] crtc:0 pflip_status=0 event=00000000a0636a23
> vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
> disable_immediate=0 active_planes=1
>
> pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was
> never completed
> but the status was already reset to NONE. vblank was enabled, refcount was
> held, so vblank
> IRQs were firing throughout the wait.
>
> The bpftrace captured the exact sequence leading up to the hang. Here's the
> critical
> timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:
>
> 9931755 drm_atomic_helper_commit_hw_done
> 9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> 9931756 dm_pflip_high_irq <- normal plane flip,
> last good one
> 9931756 drm_crtc_send_vblank_event
> 9931756 drm_vblank_put
> 9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
> 9931771 drm_vblank_disable_and_save <- vblank timer fires
> 9931771 drm_crtc_send_vblank_event <- event sent WITHOUT
> dm_pflip_high_irq
> 9931771 drm_vblank_put
> 9931771 drm_atomic_helper_commit_hw_done
> 9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> 9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929] <-
> instant, already done
> 9931773 drm_atomic_helper_commit_hw_done
> 9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929] <- new
> commit
> 9931777 dm_pflip_high_irq <- pflip fires,
> completes the wrong one
> 9931777 drm_crtc_send_vblank_event
> 9931777 drm_vblank_put
> 9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
> 9931781 drm_atomic_helper_commit_hw_done
> 9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929] <- THIS
> ONE HANGS
> ... 10328ms of silence ...
> 9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms
> [tid=36929]
>
> The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq.
> This is
> amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is
> that the
> cursor-only commit path in amdgpu_dm_commit_planes() stores the event in
> acrtc->event
> and defers delivery to the vblank handler. This creates two race conditions:
>
> - The vblank handler checks (pflip_status != SUBMITTED) which also
> matches NONE, so it can consume events meant for plane flips. The subsequent
> dm_pflip_high_irq finds no event, and the next commit hangs.
>
> - If vblank is disabled by the off-delay timer before the handler
> runs, the PENDING cursor event is never delivered and the commit hangs.
>
> The fix is to send cursor-only events immediately via
> drm_crtc_send_vblank_event()
> in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. The
> cursor
> update is already committed to hardware at this point, so immediate delivery
> is correct.
> This eliminates both race conditions by removing cursor events from the
> deferred
> delivery path entirely:
>
> - Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
> - Cursor updates: sent immediately in commit_planes (no deferral, no races)
>
> From git history the check in amdgpu_dm_crtc_handle_vblank() has been like
> this since
> 473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
> which moved this code from amdgpu_dm.c, but it was practically impossible to
> trigger
> because the default drm_vblank_offdelay was 5000ms.
> Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy for
> older ASICs") in 6.12
> changed all ASICs to use drm_crtc_vblank_on_config() with a computed off-delay
> of roughly 2 frames (~14ms at 144Hz).
> This made drm_vblank_disable_and_save fire hundreds of times more often,
> turning
> a theoretical race into reality. The bpftrace log is full of
> drm_vblank_disable_and_save
> events interleaved with the commit sequence.
>
> This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 (9070
> XT).
> Under high-frequency glxgears + cursor jiggling test the patch successfully
> intercepted
> the race thousands of times without a single timeout.
> Also running this on the main system without issues.
>
> This instead
> https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
> my previously rushed attempt to do something about this that is no longer
> needed.
>
> Patch applies cleanly on top of tag v6.19.
Really nice debuging work, thanks for catching this!
Ideally, the cursor event should be delivered when hardware latches onto the new
cursor info and starts scanning it out. The latching event fires an interrupt
that should be handled by dm_crtc_high_irq().
dm_pflip_high_irq() handles an interrupt specifically for when hardware latches
onto a new fb address; I don't think it actually fires when there's a
cursor-only update. I think if we really want to do it right, we can have
another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
the event in crtc_high_irq().
In any case, I don't foresee any major issues with delivering the event early.
And since it fixes an ongoing issue:
Reviewed-by: Leo Li <[email protected]>
Thanks!
Leo
>
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index a8a59126b2d2..35987ce80c71 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct
> drm_atomic_state *state,
> } else if (cursor_update && acrtc_state->active_planes > 0) {
> spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
> if (acrtc_attach->base.state->event) {
> - drm_crtc_vblank_get(pcrtc);
> - acrtc_attach->event = acrtc_attach->base.state->event;
> + drm_crtc_send_vblank_event(pcrtc,
> acrtc_attach->base.state->event);
> acrtc_attach->base.state->event = NULL;
> }
> spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);