On 2026-02-27 03:58, Michele Palazzi wrote:
> On 2/27/26 09:53, Michele Palazzi wrote:
>> On 2/23/26 16:27, Leo Li wrote:
>>>
>>> Really nice debuging work, thanks for catching this!
>>>
>>> Ideally, the cursor event should be delivered when hardware latches onto
>>> the new
>>> cursor info and starts scanning it out. The latching event fires an
>>> interrupt
>>> that should be handled by dm_crtc_high_irq().
>>>
>>> dm_pflip_high_irq() handles an interrupt specifically for when hardware
>>> latches
>>> onto a new fb address; I don't think it actually fires when there's a
>>> cursor-only update. I think if we really want to do it right, we can have
>>> another "acrtc_attach->cursor_event" just for cusror-only updates, and
>>> deliver
>>> the event in crtc_high_irq().
>>>
>>> In any case, I don't foresee any major issues with delivering the event
>>> early.
>>> And since it fixes an ongoing issue:
>>>
>>> Reviewed-by: Leo Li <[email protected]>
>>>
>>> Thanks!
>>> Leo
>>
>> Thanks for the review. Further testing confirms that both this patch and
>> increasing the dGPU vblank offdelay (from 2 frames to ~50 frames)
>> independently eliminate the flip timeouts in my testing. Both work by
>> reducing the frequency of vblank disable/re-enable cycles, basically either
>> could be an interim fix.
>>
>> Your deferred vblank enable/disable series https://lore.kernel.org/amd-
>> gfx/[email protected]/T/#t looks like it could be
>> the proper solution going forward instead (haven't tested it).
>>
Looking at this a bit more, I'm not sure if we're understanding the trace
correctly.
Let's first assume the cursor update is not an legacy_cursor_update: In both
non-blocking and blocking atomic commits, there should be mechanisms in place
that limits the number of in-flight atomic_commit_tail()s per crtc to 1 (see
drm_atomic_helper_wait_for_dependencies()). IOW, After each independent cursor
**or** fb update, there should be one flip_done completion from
drm_crtc_send_vblank_event(), before the next update is allowed to continue.
Since the event is "armed" as part of atomic_commit_tail(), and "completed" in
either pflip_high_irq or crtc_high_irq, racing "arms" of acrtc->event should not
be possible.
A combined cursor **and** flip update should use a single event and flip_done
completion, since it's one atomic_commit_tail to update both.
Now if it is a legacy_cursor_update, DRM core first checks if the driver can
commit it asynchronously, and set state->async_update=true if it can. If
async_update==true, drm_atomic_helper_commit() skips setting up the event
entirely. Otherwise, drm_atomic_helper_setup_commit() will check if
legacy_cursor_update==true. If it is, it completes flip_done early *and* skips
setting up the event. So either way, there's no event to send, nor flip_done to
wait on.
But evidently in the trace, something awry is going on. Though I'm not sure if
it's because of the race condition as described. It would be interesting to
trace the events at the point that they're created, armed, then completed, and
see if there's some mismatch going on.
Here's a patch that inserts a few trace events.
https://pastebin.com/dpLnVSbu
Could you try to reproduce the hang again while recording these trace events?
Using trace-cmd (with stack trace enabled '-T'):
trace-cmd record -e amdgpu_dm_event_arm -e drm_vblank_dbg* -T
trace-cmd report trace.dat
The timeout can be found by searching 'remaining_wait_ms=0'.
Regarding the deferred vblank patchset, if the issue is indeed racing writes of
amdgpu_crtc->event, then I don't imagine that patchset would help. It's
intended to solve a different race.
Thanks,
Leo
>
> fixed [email protected] cc