On Mon, Feb 23, 2026 at 11:04 AM Leo Li <[email protected]> wrote:
>
>
>
> On 2026-02-17 14:16, Michele Palazzi wrote:
> > Intermittent flip_done timeouts have been observed on AMD GPUs
> > since kernel 6.12.
> >
> > Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
> > incorrectly consume events meant for plane flips during cursor-only
> > updates. This happens because cursor commits defer event delivery to
> > the vblank handler, which checks (pflip_status != SUBMITTED). Since
> > AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
> > event slot for subsequent plane flips, leading to timeouts.
> >
> > The potential for a race was present since commit 473683a03495
> > ("drm/amd/display: Create a file dedicated for CRTC"), then
> > commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
> > policy for older ASICs") made it happen by reducing vblank
> > off-delay and making disables happen much more frequently
> > between commits.
> >
> > Fix this by sending cursor-only vblank events immediately in
> > amdgpu_dm_commit_planes(). Since cursor updates are committed to
> > hardware immediately, deferring the event is unnecessary and
> > creates race windows for event stealing or starvation if vblank
> > is disabled before the handler runs.
> >
> > Tested on DCN 2.1, 3.2, and 3.5.
> >
> > Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy
> > for older ASICs")
> > Signed-off-by: Michele Palazzi <[email protected]>
> > ---
> > I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE
> > first, 9070 XT now)
> > since kernel 6.12. The hang occurs during normal desktop usage but is much
> > easier to
> > trigger under specific conditions involving cursor movements and plane
> > updates.
> >
> > Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787
> >
> > Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
> > Dual DP monitors, 2560x1440, 144Hz
> > Desktop: KDE Plasma Wayland
> >
> > The hang was initially observed while using Cisco Webex
> > (XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
> > and screen share a window running Omnissa Horizon client. Then move the
> > cursor
> > around between the two monitors and the shared window.
> > Under these conditions the hang usually occurs within a few hours.
> >
> > Enabling drm.debug masks the issue entirely, the overhead
> > changes timing enough to close the race window.
> > So i added debug printks to amdgpu_dm.c and used a small bpftrace script to
> > log the
> > pageflip lifecycle with per-thread tracking to debug.
> >
> > bpftrace script:
> >
> > config = { missing_probes = "warn" }
> > BEGIN { printf("=== flip_done tracer started ===\n"); }
> > kprobe:drm_crtc_vblank_on_config { printf("%lu
> > drm_crtc_vblank_on_config\n", nsecs/1000000); }
> > kprobe:drm_vblank_disable_and_save { printf("%lu
> > drm_vblank_disable_and_save\n", nsecs/1000000); }
> > kprobe:dm_pflip_high_irq { printf("%lu
> > dm_pflip_high_irq\n", nsecs/1000000); }
> > kprobe:drm_crtc_send_vblank_event { printf("%lu
> > drm_crtc_send_vblank_event\n", nsecs/1000000); }
> > kprobe:drm_vblank_put { printf("%lu drm_vblank_put\n",
> > nsecs/1000000); }
> > kprobe:drm_atomic_helper_commit_hw_done { printf("%lu
> > drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
> > kprobe:manage_dm_interrupts { printf("%lu
> > manage_dm_interrupts\n", nsecs/1000000); }
> > kprobe:drm_atomic_helper_wait_for_flip_done {
> > @wait_start[tid] = nsecs;
> > printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n",
> > nsecs/1000000, tid);
> > }
> > kretprobe:drm_atomic_helper_wait_for_flip_done {
> > $start = @wait_start[tid];
> > $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
> > if ($ms > 100) {
> > printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited
> > %lums [tid=%d]\n",
> > nsecs/1000000, $ms, tid);
> > } else {
> > printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums
> > [tid=%d]\n",
> > nsecs/1000000, $ms, tid);
> > }
> > delete(@wait_start[tid]);
> > }
> > interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
> > END { printf("=== stopped ===\n"); clear(@wait_start); }
> >
> > The timeout was captured at 17:35:41 CET. The trace timestamps
> > match dmesg exactly (9942110ms = dmesg 9942.110s).
> >
> > dmesg output from the timeout:
> >
> > [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
> > [ 9942.110380] [FLIP_DEBUG] crtc:0 pflip_status=0 event=00000000a0636a23
> > vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
> > disable_immediate=0 active_planes=1
> >
> > pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was
> > never completed
> > but the status was already reset to NONE. vblank was enabled, refcount was
> > held, so vblank
> > IRQs were firing throughout the wait.
> >
> > The bpftrace captured the exact sequence leading up to the hang. Here's the
> > critical
> > timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:
> >
> > 9931755 drm_atomic_helper_commit_hw_done
> > 9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> > 9931756 dm_pflip_high_irq <- normal plane flip,
> > last good one
> > 9931756 drm_crtc_send_vblank_event
> > 9931756 drm_vblank_put
> > 9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
> > 9931771 drm_vblank_disable_and_save <- vblank timer fires
> > 9931771 drm_crtc_send_vblank_event <- event sent WITHOUT
> > dm_pflip_high_irq
> > 9931771 drm_vblank_put
> > 9931771 drm_atomic_helper_commit_hw_done
> > 9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> > 9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929] <-
> > instant, already done
> > 9931773 drm_atomic_helper_commit_hw_done
> > 9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929] <- new
> > commit
> > 9931777 dm_pflip_high_irq <- pflip fires,
> > completes the wrong one
> > 9931777 drm_crtc_send_vblank_event
> > 9931777 drm_vblank_put
> > 9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
> > 9931781 drm_atomic_helper_commit_hw_done
> > 9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929] <-
> > THIS ONE HANGS
> > ... 10328ms of silence ...
> > 9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms
> > [tid=36929]
> >
> > The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq.
> > This is
> > amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is
> > that the
> > cursor-only commit path in amdgpu_dm_commit_planes() stores the event in
> > acrtc->event
> > and defers delivery to the vblank handler. This creates two race conditions:
> >
> > - The vblank handler checks (pflip_status != SUBMITTED) which also
> > matches NONE, so it can consume events meant for plane flips. The
> > subsequent
> > dm_pflip_high_irq finds no event, and the next commit hangs.
> >
> > - If vblank is disabled by the off-delay timer before the handler
> > runs, the PENDING cursor event is never delivered and the commit hangs.
> >
> > The fix is to send cursor-only events immediately via
> > drm_crtc_send_vblank_event()
> > in amdgpu_dm_commit_planes() instead of deferring to the vblank handler.
> > The cursor
> > update is already committed to hardware at this point, so immediate
> > delivery is correct.
> > This eliminates both race conditions by removing cursor events from the
> > deferred
> > delivery path entirely:
> >
> > - Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
> > - Cursor updates: sent immediately in commit_planes (no deferral, no races)
> >
> > From git history the check in amdgpu_dm_crtc_handle_vblank() has been like
> > this since
> > 473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
> > which moved this code from amdgpu_dm.c, but it was practically impossible
> > to trigger
> > because the default drm_vblank_offdelay was 5000ms.
> > Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy
> > for older ASICs") in 6.12
> > changed all ASICs to use drm_crtc_vblank_on_config() with a computed
> > off-delay
> > of roughly 2 frames (~14ms at 144Hz).
> > This made drm_vblank_disable_and_save fire hundreds of times more often,
> > turning
> > a theoretical race into reality. The bpftrace log is full of
> > drm_vblank_disable_and_save
> > events interleaved with the commit sequence.
> >
> > This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5
> > (9070 XT).
> > Under high-frequency glxgears + cursor jiggling test the patch successfully
> > intercepted
> > the race thousands of times without a single timeout.
> > Also running this on the main system without issues.
> >
> > This instead
> > https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
> > my previously rushed attempt to do something about this that is no longer
> > needed.
> >
> > Patch applies cleanly on top of tag v6.19.
>
> Really nice debuging work, thanks for catching this!
>
> Ideally, the cursor event should be delivered when hardware latches onto the
> new
> cursor info and starts scanning it out. The latching event fires an interrupt
> that should be handled by dm_crtc_high_irq().
>
> dm_pflip_high_irq() handles an interrupt specifically for when hardware
> latches
> onto a new fb address; I don't think it actually fires when there's a
> cursor-only update. I think if we really want to do it right, we can have
> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
> the event in crtc_high_irq().
>
> In any case, I don't foresee any major issues with delivering the event early.
> And since it fixes an ongoing issue:
>
> Reviewed-by: Leo Li <[email protected]>
Leo, I assume you are planning to push this?
Alex
>
> Thanks!
> Leo
>
> >
> > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index a8a59126b2d2..35987ce80c71 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct
> > drm_atomic_state *state,
> > } else if (cursor_update && acrtc_state->active_planes > 0) {
> > spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
> > if (acrtc_attach->base.state->event) {
> > - drm_crtc_vblank_get(pcrtc);
> > - acrtc_attach->event = acrtc_attach->base.state->event;
> > + drm_crtc_send_vblank_event(pcrtc,
> > acrtc_attach->base.state->event);
> > acrtc_attach->base.state->event = NULL;
> > }
> > spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
>