https://bugs.kde.org/show_bug.cgi?id=521727

--- Comment #7 from sunghwan <[email protected]> ---
I believe I’ve identified the root cause of the issue and a way to continue
using the Vulkan backend on NVIDIA GPUs. I’m sharing my analysis and a
proof-of-concept patch in the hope that it might be of some assistance. This
was actually my first time analyzing the KWin codebase in depth, so I relied on
LLM assistance to understand the structure of KWin and draft the initial patch.
I would appreciate it if you could point out any mistakes or areas that might
need correction.

In KWin’s architecture, the copy operation is executed on the destination GPU.
The output GPU (NVIDIA, represented by m_copyDevice) imports the source buffer
(allocated on Intel) and runs Vulkan commands on its own queue to copy the
frame to its presentation swapchain. Because the Vulkan copy commands are
executed on the NVIDIA GPU, the NVIDIA driver is the one performing the memory
barriers and texture transfers, which is why the driver-level page faults and
GPU freezes occurred on the NVIDIA GPU.

Here are the details for a more in-depth analysis:

• Queue Family Transitions & PTE Write Page Faults: KWin transitions queue
family ownership to  vk::QueueFamilyExternal  to release image control back to
the display controller (required on Intel/AMD to handle color
decompression/DCC). However, on NVIDIA dGPUs, releasing imported external
memory to  vk::QueueFamilyExternal  causes the driver's page table entries
(PTE) to be invalidated or unmapped. When the GPU's Copy Engine tries to write
to the destination buffer, it hits virtual write page faults.

• Missing Source Image Layout Transitions & PDE Faults: Newly imported DMA-BUF
textures start in the undefined layout in Vulkan. Accessing them under the
assumption that they are in the general layout without an initial layout
transition leads to page directory translation faults.

• Graphics Sampler & PTE Read Page Faults: Blitting images relies on the
graphics pipeline and texture units. The hardware texture sampler unit has
strict address alignment limits for modifier-tiled external memory. When trying
to sample or read from the tiled DMA-BUF texture, it hits virtual read page
faults.

Addressing the three issues above has significantly reduced the system freezes.
However, the symptoms eventually recur after an extended period of time;
therefore, applying the fix for the Inter-API Fence Deadlocks below is
necessary to ensure long-term stability

• Inter-API Fence Deadlocks: In hybrid setups, the frame is rendered on OpenGL
(compositor) and copied via Vulkan. Syncing OpenGL completion to Vulkan
requires exporting an OpenGL native fence file descriptor (sync FD) and
importing it as a Vulkan binary semaphore. Under load or after running for a
while, these fences fail to resolve in the NVIDIA driver, causing queue
deadlocks and Vulkan device lost timeouts.

Here is the link to the proof-of-concept patch(https://pastebin.com/0pnLGQrX).
Please note that it is not yet fully polished, but applying it has resolved the
NVIDIA GPU crashes in Vulkan. I sincerely hope this proves helpful for your
analysis.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to