> Then second even if the kernel can do it I'm not sure if it should do it. > > I mean userspace asked for a quick page flip and not some expensive CRTC/PLL > reprogramming. Stuff like that usually takes some time and by then the frame > which should be displayed by the page flip might already be stale and it > would be better to tell userspace that we couldn't display it on time and > wait for a new frame to be generated.
I would personally prefer a new "pageflip failed" event, which the compositor can react to appropriately. For compositors not opting into that new API, the kernel automatically fixing things would be nice though. Even pretending the pageflip completed and then failing the next one with EINVAL would be enough to trigger a modeset in the case of KWin. > And third, there must be a root cause of the page flip not completing. > > My educated guess is that we have some atomic property change or even turning > the CRTC off in parallel with the page flip. I mean HW rarely turns off its > reoccurring vblank interrupt on its own. > > Returning an error to userspace might actually help identify the root cause. There are two things I know that trigger pageflip timeouts frequently. On dedicated GPUs, most of them seem to happen when a game causes a GPU reset. In some cases, it seemed like the pageflip did complete, but the kernel never sent the pageflip event to userspace. It also rejected new atomic commits of the compositor with EBUSY - but a new instance of the compositor could still do atomic commits just fine. In other cases, triggering another GPU reset was necessary to recover, and in yet other cases it was just broken beyond repair. Presumably, all of them are caused by bugs in the GPU reset sequence. As another piece of information on that, KWin only does atomic commits once the fences of the involved buffers are signaled, and it does not use OUT_FENCE_FD. So fence signaling should not be relevant, at least not on the KMS uAPI level. On APUs, the vast majority are caused by PSR. I know many AMD laptop users that run with "amdgpu.dcdebugmask=0x10" to disable PSR as a workaround, and have never seen a pageflip timeout since setting that option. I have even seen a freeze *without* a pageflip timeout in testing PSR again on my laptop recently, so PSR seems to have even bigger issues. Pageflip timeout or not, manually triggering a GPU reset seems to be a reliable way to recover from it. IMO that one is bad and widespread enough that PSR should be disabled by default on the relevant AMD hardware until it no longer causes such problems. On the topic of whether or not this is just a thing the driver has to fix, this isn't as exclusive to amdgpu as it might seem. i915 has some pageflip timeout issues with apparently still unknown causes (https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14604), and the proprietary Nvidia driver had one some time ago that IIRC was caused by firmware bugs. Obviously, drivers still need to be fixed, but the bug is bad enough for the end user that a fallback would be very helpful. If userspace gets notified about it, we can still direct users to the relevant bug trackers to get the underlying bugs hopefully fixed either way. - Xaver
