virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() wait for
virtqueue space using wait_event() with vqs_released as the only abort
condition. This covers the device removal path, where
virtio_gpu_release_vqs() sets the flag, but does not help when the host
simply stops processing the virtqueue while the device remains present.

In that case, the virtqueue fills up and subsequent command submissions
block indefinitely in D state with no diagnostic output, making the root
cause difficult to identify.

Replace the bare wait_event() with a wait_event_timeout() loop that
prints a warning every 5 seconds while the virtqueue remains full. The
wait still blocks indefinitely so driver behavior is unchanged. The
warnings help identify an unresponsive host device during
troubleshooting.

Signed-off-by: Ryosuke Yasuoka <[email protected]>
---
When the host stops processing the virtio-gpu virtqueue without
triggering device removal, the bare wait_event() in
virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() blocks
indefinitely with no diagnostic output. A DRM atomic commit worker
blocks in virtio_gpu_queue_fenced_ctrl_buffer() while holding the
modeset lock. During graceful shutdown, systemd (PID 1) needs the same
lock — either by writing to the console via fbcon, or by closing a DRM
file descriptor that triggers framebuffer cleanup — and blocks as well,
making the VM unrecoverable without a forced power-off.

  PID: 553    COMMAND: "kworker/u4:3"
   #0 __schedule
   #1 schedule
   #2 virtio_gpu_queue_fenced_ctrl_buffer [virtio_gpu]
   #3 virtio_gpu_primary_plane_update [virtio_gpu]
   ...

  PID: 1      COMMAND: "systemd"  (console write path)
   #0 __schedule
   #1 schedule
   #2 schedule_preempt_disabled
   #3 __ww_mutex_lock
   #4 drm_modeset_lock [drm]
   #5 drm_atomic_get_plane_state [drm]
   #6 drm_client_modeset_commit_atomic [drm]
   #7 drm_client_modeset_commit_locked [drm]
   #8 drm_fb_helper_pan_display [drm_kms_helper]
   #9 fb_pan_display
  #10 bit_update_start
  #11 fbcon_switch
  #12 redraw_screen
   ...

Reproduction steps:
1. Build QEMU with the fault injection patch [1] that adds an
   x-ctrl-queue-broken property to virtio-gpu.
2. Boot the VM and trigger the fault injection from the host.
3. Fill the ctrlq (e.g., move the mouse on the guest's display).
   The process gets stuck in virtio_gpu_queue_fenced_ctrl_buffer()
   in D state.
4. Run a graceful shutdown command (shutdown now or reboot).
5. The shutdown process hangs.

My earlier patch a46991b334f6 ("drm/virtio: abort virtqueue wait on
device removal to avoid hung task") covers the case where the shutdown
process reaches the device_shutdown() call path, which sets vqs_released
to unblock the wait. However, during graceful shutdown, systemd (PID 1)
gets stuck on the modeset lock before ever reaching device_shutdown(),
so vqs_released is never set and the wait is never unblocked.

I initially considered adding a module parameter to abort the wait with
-ENODEV on timeout:

  +static unsigned int virtio_gpu_vq_timeout;
  +MODULE_PARM_DESC(vq_timeout,
  +     "Timeout in seconds for virtqueue wait (0 = no timeout, default)");
  +module_param_named(vq_timeout, virtio_gpu_vq_timeout, uint, 0444);
  ...
  +             if (virtio_gpu_vq_timeout) {
  +                     if (!wait_event_timeout(vgdev->ctrlq.ack_queue,
  +                                             vq->num_free >= elemcnt ||
  +                                             vgdev->vqs_released,
  +                                             
secs_to_jiffies(virtio_gpu_vq_timeout))) {
  +                             if (fence && vbuf->objs)
  +                                     
virtio_gpu_array_unlock_resv(vbuf->objs);
  +                             free_vbuf(vgdev, vbuf);
  +                             drm_dev_exit(idx);
  +                             return -ENODEV;
  +                     }
  +             } else {
  +                     wait_event(vgdev->ctrlq.ack_queue,
  +                                vq->num_free >= elemcnt ||
  +                                vgdev->vqs_released);
  +             }

This approach aborts the wait and allows the graceful shutdown process
to eventually proceed, albeit with a delay.

But that approach has drawbacks: it allows users to set arbitrarily
short timeouts that could destabilize the driver, and aborting commands
mid-flight is a rough recovery path. An unconditional timeout was also
discussed previously [2] but is not appropriate without virtio
specification support.

This patch takes a safer approach: replace the bare wait_event() with
wait_event_timeout() and print a warning every 5 seconds while the
virtqueue remains full. The wait still blocks indefinitely and no
commands are aborted, so driver behavior is unchanged. The warnings
help identify an unresponsive host device during troubleshooting.
Once the user notices the warning, they can work around the hang by
unbinding the VT from fbcon, removing the device, or forcing a shutdown
via SysRq.

[1] https://gist.github.com/YsuOS/fbcd181752594af35f954953a1d260b8
[2] 
https://lore.kernel.org/all/[email protected]/
---
 drivers/gpu/drm/virtio/virtgpu_vq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c 
b/drivers/gpu/drm/virtio/virtgpu_vq.c
index 68d097ad9d1d..a546130d3b6a 100644
--- a/drivers/gpu/drm/virtio/virtgpu_vq.c
+++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
@@ -410,8 +410,13 @@ static int virtio_gpu_queue_ctrl_sgs(struct 
virtio_gpu_device *vgdev,
        if (vq->num_free < elemcnt) {
                spin_unlock(&vgdev->ctrlq.qlock);
                virtio_gpu_notify(vgdev);
-               wait_event(vgdev->ctrlq.ack_queue,
-                          vq->num_free >= elemcnt || vgdev->vqs_released);
+               while (!wait_event_timeout(vgdev->ctrlq.ack_queue,
+                                          vq->num_free >= elemcnt ||
+                                          vgdev->vqs_released,
+                                          5 * HZ) && !vgdev->vqs_released)
+                       DRM_WARN("ctrlq waiting for host: no free space for %d 
secs\n",
+                                5);
+
                /*
                 * Set by virtio_gpu_release_vqs() to unblock
                 * synchronize_srcu() wait in drm_dev_unplug().
@@ -592,8 +597,13 @@ static void virtio_gpu_queue_cursor(struct 
virtio_gpu_device *vgdev,
        ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
        if (ret == -ENOSPC) {
                spin_unlock(&vgdev->cursorq.qlock);
-               wait_event(vgdev->cursorq.ack_queue,
-                          vq->num_free >= outcnt || vgdev->vqs_released);
+               while (!wait_event_timeout(vgdev->cursorq.ack_queue,
+                                          vq->num_free >= outcnt ||
+                                          vgdev->vqs_released,
+                                          5 * HZ) && !vgdev->vqs_released)
+                       DRM_WARN("cursorq waiting for host: no free space for 
%d secs\n",
+                                5);
+
                /* See comment in virtio_gpu_queue_ctrl_sgs(). */
                if (vgdev->vqs_released) {
                        free_vbuf(vgdev, vbuf);

---
base-commit: b9e2d5cdaab05c997be3a69d9b372d7676683e1b
change-id: 20260616-virtiogpu_add_timeout-33e985718c22

Best regards,
-- 
Ryosuke Yasuoka <[email protected]>


Reply via email to