On Mon, Feb 23, 2026 at 11:04 AM Leo Li <[email protected]> wrote:
>
>
>
> On 2026-02-17 14:16, Michele Palazzi wrote:
> > Intermittent flip_done timeouts have been observed on AMD GPUs
> > since kernel 6.12.
> >
> > Analysis with bpftrace reveals that amdgpu_dm_crtc_handle_vblank() can
> > incorrectly consume events meant for plane flips during cursor-only
> > updates. This happens because cursor commits defer event delivery to
> > the vblank handler, which checks (pflip_status != SUBMITTED). Since
> > AMDGPU_FLIP_NONE also matches this, cursor events can "steal" the
> > event slot for subsequent plane flips, leading to timeouts.
> >
> > The potential for a race was present since commit 473683a03495
> > ("drm/amd/display: Create a file dedicated for CRTC"), then
> > commit 58a261bfc967 ("drm/amd/display: use a more lax vblank enable
> > policy for older ASICs") made it happen by reducing vblank
> > off-delay and making disables happen much more frequently
> > between commits.
> >
> > Fix this by sending cursor-only vblank events immediately in
> > amdgpu_dm_commit_planes(). Since cursor updates are committed to
> > hardware immediately, deferring the event is unnecessary and
> > creates race windows for event stealing or starvation if vblank
> > is disabled before the handler runs.
> >
> > Tested on DCN 2.1, 3.2, and 3.5.
> >
> > Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy 
> > for older ASICs")
> > Signed-off-by: Michele Palazzi <[email protected]>
> > ---
> > I've been chasing intermittent flip_done timeouts on AMD GPUs (7900 GRE 
> > first, 9070 XT now)
> > since kernel 6.12. The hang occurs during normal desktop usage but is much 
> > easier to
> > trigger under specific conditions involving cursor movements and plane 
> > updates.
> >
> > Partially tracked in https://gitlab.freedesktop.org/drm/amd/-/issues/3787
> >
> > Hardware: Ryzen 7 7800X3D, Radeon RX 9070 XT
> > Dual DP monitors, 2560x1440, 144Hz
> > Desktop: KDE Plasma Wayland
> >
> > The hang was initially observed while using Cisco Webex
> > (XDG_SESSION_TYPE=x11 /opt/Webex/bin/CiscoCollabHost %U), start a meeting
> > and screen share a window running Omnissa Horizon client. Then move the 
> > cursor
> > around between the two monitors and the shared window.
> > Under these conditions the hang usually occurs within a few hours.
> >
> > Enabling drm.debug masks the issue entirely, the overhead
> > changes timing enough to close the race window.
> > So i added debug printks to amdgpu_dm.c and used a small bpftrace script to 
> > log the
> > pageflip lifecycle with per-thread tracking to debug.
> >
> > bpftrace script:
> >
> >   config = { missing_probes = "warn" }
> >   BEGIN { printf("=== flip_done tracer started ===\n"); }
> >   kprobe:drm_crtc_vblank_on_config       { printf("%lu 
> > drm_crtc_vblank_on_config\n", nsecs/1000000); }
> >   kprobe:drm_vblank_disable_and_save     { printf("%lu 
> > drm_vblank_disable_and_save\n", nsecs/1000000); }
> >   kprobe:dm_pflip_high_irq               { printf("%lu 
> > dm_pflip_high_irq\n", nsecs/1000000); }
> >   kprobe:drm_crtc_send_vblank_event      { printf("%lu 
> > drm_crtc_send_vblank_event\n", nsecs/1000000); }
> >   kprobe:drm_vblank_put                  { printf("%lu drm_vblank_put\n", 
> > nsecs/1000000); }
> >   kprobe:drm_atomic_helper_commit_hw_done { printf("%lu 
> > drm_atomic_helper_commit_hw_done\n", nsecs/1000000); }
> >   kprobe:manage_dm_interrupts            { printf("%lu 
> > manage_dm_interrupts\n", nsecs/1000000); }
> >   kprobe:drm_atomic_helper_wait_for_flip_done {
> >       @wait_start[tid] = nsecs;
> >       printf("%lu drm_atomic_helper_wait_for_flip_done ENTER [tid=%d]\n", 
> > nsecs/1000000, tid);
> >   }
> >   kretprobe:drm_atomic_helper_wait_for_flip_done {
> >       $start = @wait_start[tid];
> >       $ms = $start > 0 ? (nsecs - $start) / 1000000 : 0;
> >       if ($ms > 100) {
> >           printf("%lu drm_atomic_helper_wait_for_flip_done TIMEOUT waited 
> > %lums [tid=%d]\n",
> >                  nsecs/1000000, $ms, tid);
> >       } else {
> >           printf("%lu drm_atomic_helper_wait_for_flip_done EXIT %lums 
> > [tid=%d]\n",
> >                  nsecs/1000000, $ms, tid);
> >       }
> >       delete(@wait_start[tid]);
> >   }
> >   interval:s:60 { printf("%lu HEARTBEAT\n", nsecs/1000000); }
> >   END { printf("=== stopped ===\n"); clear(@wait_start); }
> >
> > The timeout was captured at 17:35:41 CET. The trace timestamps
> > match dmesg exactly (9942110ms = dmesg 9942.110s).
> >
> > dmesg output from the timeout:
> >
> >   [ 9942.110360] [FLIP_DEBUG] wait_for_flip_done took 10329ms!
> >   [ 9942.110380] [FLIP_DEBUG]  crtc:0 pflip_status=0 event=00000000a0636a23
> >                   vbl_enabled=1 vbl_refcount=1 vbl_count=1428659
> >                   disable_immediate=0 active_planes=1
> >
> > pflip_status=0 (AMDGPU_FLIP_NONE) but event is still non-NULL. The flip was 
> > never completed
> > but the status was already reset to NONE. vblank was enabled, refcount was 
> > held, so vblank
> > IRQs were firing throughout the wait.
> >
> > The bpftrace captured the exact sequence leading up to the hang. Here's the 
> > critical
> > timeline at ~17:35:31 (9931771), about 10 seconds before the timeout fired:
> >
> >   9931755 drm_atomic_helper_commit_hw_done
> >   9931755 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> >   9931756 dm_pflip_high_irq                           <- normal plane flip, 
> > last good one
> >   9931756 drm_crtc_send_vblank_event
> >   9931756 drm_vblank_put
> >   9931756 drm_atomic_helper_wait_for_flip_done EXIT 1ms [tid=35929]
> >   9931771 drm_vblank_disable_and_save                 <- vblank timer fires
> >   9931771 drm_crtc_send_vblank_event                  <- event sent WITHOUT 
> > dm_pflip_high_irq
> >   9931771 drm_vblank_put
> >   9931771 drm_atomic_helper_commit_hw_done
> >   9931771 drm_atomic_helper_wait_for_flip_done ENTER [tid=35929]
> >   9931771 drm_atomic_helper_wait_for_flip_done EXIT 0ms [tid=35929]  <- 
> > instant, already done
> >   9931773 drm_atomic_helper_commit_hw_done
> >   9931773 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- new 
> > commit
> >   9931777 dm_pflip_high_irq                           <- pflip fires, 
> > completes the wrong one
> >   9931777 drm_crtc_send_vblank_event
> >   9931777 drm_vblank_put
> >   9931777 drm_atomic_helper_wait_for_flip_done EXIT 3ms [tid=36929]
> >   9931781 drm_atomic_helper_commit_hw_done
> >   9931781 drm_atomic_helper_wait_for_flip_done ENTER [tid=36929]     <- 
> > THIS ONE HANGS
> >   ... 10328ms of silence ...
> >   9942110 drm_atomic_helper_wait_for_flip_done TIMEOUT waited 10328ms 
> > [tid=36929]
> >
> > The drm_crtc_send_vblank_event at 9931771 fires without dm_pflip_high_irq. 
> > This is
> > amdgpu_dm_crtc_handle_vblank() sending a cursor-only event. The problem is 
> > that the
> > cursor-only commit path in amdgpu_dm_commit_planes() stores the event in 
> > acrtc->event
> > and defers delivery to the vblank handler. This creates two race conditions:
> >
> > - The vblank handler checks (pflip_status != SUBMITTED) which also
> >   matches NONE, so it can consume events meant for plane flips. The 
> > subsequent
> >   dm_pflip_high_irq finds no event, and the next commit hangs.
> >
> > - If vblank is disabled by the off-delay timer before the handler
> >   runs, the PENDING cursor event is never delivered and the commit hangs.
> >
> > The fix is to send cursor-only events immediately via 
> > drm_crtc_send_vblank_event()
> > in amdgpu_dm_commit_planes() instead of deferring to the vblank handler. 
> > The cursor
> > update is already committed to hardware at this point, so immediate 
> > delivery is correct.
> > This eliminates both race conditions by removing cursor events from the 
> > deferred
> > delivery path entirely:
> >
> > - Plane flips: SUBMITTED -> dm_pflip_high_irq delivers (unchanged)
> > - Cursor updates: sent immediately in commit_planes (no deferral, no races)
> >
> > From git history the check in amdgpu_dm_crtc_handle_vblank() has been like 
> > this since
> > 473683a03495 ("drm/amd/display: Create a file dedicated for CRTC", 2022)
> > which moved this code from amdgpu_dm.c, but it was practically impossible 
> > to trigger
> > because the default drm_vblank_offdelay was 5000ms.
> > Commit 58a261bfc967("drm/amd/display: use a more lax vblank enable policy 
> > for older ASICs") in 6.12
> > changed all ASICs to use drm_crtc_vblank_on_config() with a computed 
> > off-delay
> > of roughly 2 frames (~14ms at 144Hz).
> > This made drm_vblank_disable_and_save fire hundreds of times more often, 
> > turning
> > a theoretical race into reality. The bpftrace log is full of 
> > drm_vblank_disable_and_save
> > events interleaved with the commit sequence.
> >
> > This fix was tested on DCN 2.1 (4700U), DCN 3.2 (7600M XT), and DCN 3.5 
> > (9070 XT).
> > Under high-frequency glxgears + cursor jiggling test the patch successfully 
> > intercepted
> > the race thousands of times without a single timeout.
> > Also running this on the main system without issues.
> >
> > This instead 
> > https://lists.freedesktop.org/archives/amd-gfx/2026-February/138636.html was
> > my previously rushed attempt to do something about this that is no longer 
> > needed.
> >
> > Patch applies cleanly on top of tag v6.19.
>
> Really nice debuging work, thanks for catching this!
>
> Ideally, the cursor event should be delivered when hardware latches onto the 
> new
> cursor info and starts scanning it out. The latching event fires an interrupt
> that should be handled by dm_crtc_high_irq().
>
> dm_pflip_high_irq() handles an interrupt specifically for when hardware 
> latches
> onto a new fb address; I don't think it actually fires when there's a
> cursor-only update. I think if we really want to do it right, we can have
> another "acrtc_attach->cursor_event" just for cusror-only updates, and deliver
> the event in crtc_high_irq().
>
> In any case, I don't foresee any major issues with delivering the event early.
> And since it fixes an ongoing issue:
>
> Reviewed-by: Leo Li <[email protected]>

Leo, I assume you are planning to push this?

Alex

>
> Thanks!
> Leo
>
> >
> >  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index a8a59126b2d2..35987ce80c71 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -10168,8 +10168,7 @@ static void amdgpu_dm_commit_planes(struct 
> > drm_atomic_state *state,
> >       } else if (cursor_update && acrtc_state->active_planes > 0) {
> >               spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
> >               if (acrtc_attach->base.state->event) {
> > -                     drm_crtc_vblank_get(pcrtc);
> > -                     acrtc_attach->event = acrtc_attach->base.state->event;
> > +                     drm_crtc_send_vblank_event(pcrtc, 
> > acrtc_attach->base.state->event);
> >                       acrtc_attach->base.state->event = NULL;
> >               }
> >               spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
>

Reply via email to