https://bugs.kde.org/show_bug.cgi?id=521727
--- Comment #7 from sunghwan <[email protected]> --- I believe I’ve identified the root cause of the issue and a way to continue using the Vulkan backend on NVIDIA GPUs. I’m sharing my analysis and a proof-of-concept patch in the hope that it might be of some assistance. This was actually my first time analyzing the KWin codebase in depth, so I relied on LLM assistance to understand the structure of KWin and draft the initial patch. I would appreciate it if you could point out any mistakes or areas that might need correction. In KWin’s architecture, the copy operation is executed on the destination GPU. The output GPU (NVIDIA, represented by m_copyDevice) imports the source buffer (allocated on Intel) and runs Vulkan commands on its own queue to copy the frame to its presentation swapchain. Because the Vulkan copy commands are executed on the NVIDIA GPU, the NVIDIA driver is the one performing the memory barriers and texture transfers, which is why the driver-level page faults and GPU freezes occurred on the NVIDIA GPU. Here are the details for a more in-depth analysis: • Queue Family Transitions & PTE Write Page Faults: KWin transitions queue family ownership to vk::QueueFamilyExternal to release image control back to the display controller (required on Intel/AMD to handle color decompression/DCC). However, on NVIDIA dGPUs, releasing imported external memory to vk::QueueFamilyExternal causes the driver's page table entries (PTE) to be invalidated or unmapped. When the GPU's Copy Engine tries to write to the destination buffer, it hits virtual write page faults. • Missing Source Image Layout Transitions & PDE Faults: Newly imported DMA-BUF textures start in the undefined layout in Vulkan. Accessing them under the assumption that they are in the general layout without an initial layout transition leads to page directory translation faults. • Graphics Sampler & PTE Read Page Faults: Blitting images relies on the graphics pipeline and texture units. The hardware texture sampler unit has strict address alignment limits for modifier-tiled external memory. When trying to sample or read from the tiled DMA-BUF texture, it hits virtual read page faults. Addressing the three issues above has significantly reduced the system freezes. However, the symptoms eventually recur after an extended period of time; therefore, applying the fix for the Inter-API Fence Deadlocks below is necessary to ensure long-term stability • Inter-API Fence Deadlocks: In hybrid setups, the frame is rendered on OpenGL (compositor) and copied via Vulkan. Syncing OpenGL completion to Vulkan requires exporting an OpenGL native fence file descriptor (sync FD) and importing it as a Vulkan binary semaphore. Under load or after running for a while, these fences fail to resolve in the NVIDIA driver, causing queue deadlocks and Vulkan device lost timeouts. Here is the link to the proof-of-concept patch(https://pastebin.com/0pnLGQrX). Please note that it is not yet fully polished, but applying it has resolved the NVIDIA GPU crashes in Vulkan. I sincerely hope this proves helpful for your analysis. -- You are receiving this mail because: You are watching all bug changes.
