Hi,

On 6/18/26 10:18, Ryosuke Yasuoka wrote:
> virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() wait for
> virtqueue space using wait_event() with vqs_released as the only abort
> condition. This covers the device removal path, where
> virtio_gpu_release_vqs() sets the flag, but does not help when the host
> simply stops processing the virtqueue while the device remains present.
> 
> In that case, the virtqueue fills up and subsequent command submissions
> block indefinitely in D state with no diagnostic output, making the root
> cause difficult to identify.
> 
> Replace the bare wait_event() with a wait_event_timeout() loop that
> prints a warning every 5 seconds while the virtqueue remains full. The
> wait still blocks indefinitely so driver behavior is unchanged. The
> warnings help identify an unresponsive host device during
> troubleshooting.
> 
> Signed-off-by: Ryosuke Yasuoka <[email protected]>
> ---
> When the host stops processing the virtio-gpu virtqueue without
> triggering device removal, the bare wait_event() in
> virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() blocks
> indefinitely with no diagnostic output. A DRM atomic commit worker
> blocks in virtio_gpu_queue_fenced_ctrl_buffer() while holding the
> modeset lock. During graceful shutdown, systemd (PID 1) needs the same
> lock — either by writing to the console via fbcon, or by closing a DRM
> file descriptor that triggers framebuffer cleanup — and blocks as well,
> making the VM unrecoverable without a forced power-off.
> 
>   PID: 553    COMMAND: "kworker/u4:3"
>    #0 __schedule
>    #1 schedule
>    #2 virtio_gpu_queue_fenced_ctrl_buffer [virtio_gpu]
>    #3 virtio_gpu_primary_plane_update [virtio_gpu]
>    ...
> 
>   PID: 1      COMMAND: "systemd"  (console write path)
>    #0 __schedule
>    #1 schedule
>    #2 schedule_preempt_disabled
>    #3 __ww_mutex_lock
>    #4 drm_modeset_lock [drm]
>    #5 drm_atomic_get_plane_state [drm]
>    #6 drm_client_modeset_commit_atomic [drm]
>    #7 drm_client_modeset_commit_locked [drm]
>    #8 drm_fb_helper_pan_display [drm_kms_helper]
>    #9 fb_pan_display
>   #10 bit_update_start
>   #11 fbcon_switch
>   #12 redraw_screen
>    ...
> 
> Reproduction steps:
> 1. Build QEMU with the fault injection patch [1] that adds an
>    x-ctrl-queue-broken property to virtio-gpu.
> 2. Boot the VM and trigger the fault injection from the host.
> 3. Fill the ctrlq (e.g., move the mouse on the guest's display).
>    The process gets stuck in virtio_gpu_queue_fenced_ctrl_buffer()
>    in D state.
> 4. Run a graceful shutdown command (shutdown now or reboot).
> 5. The shutdown process hangs.
> 
> My earlier patch a46991b334f6 ("drm/virtio: abort virtqueue wait on
> device removal to avoid hung task") covers the case where the shutdown
> process reaches the device_shutdown() call path, which sets vqs_released
> to unblock the wait. However, during graceful shutdown, systemd (PID 1)
> gets stuck on the modeset lock before ever reaching device_shutdown(),
> so vqs_released is never set and the wait is never unblocked.
> 
> I initially considered adding a module parameter to abort the wait with
> -ENODEV on timeout:
> 
>   +static unsigned int virtio_gpu_vq_timeout;
>   +MODULE_PARM_DESC(vq_timeout,
>   +     "Timeout in seconds for virtqueue wait (0 = no timeout, default)");
>   +module_param_named(vq_timeout, virtio_gpu_vq_timeout, uint, 0444);
>   ...
>   +             if (virtio_gpu_vq_timeout) {
>   +                     if (!wait_event_timeout(vgdev->ctrlq.ack_queue,
>   +                                             vq->num_free >= elemcnt ||
>   +                                             vgdev->vqs_released,
>   +                                             
> secs_to_jiffies(virtio_gpu_vq_timeout))) {
>   +                             if (fence && vbuf->objs)
>   +                                     
> virtio_gpu_array_unlock_resv(vbuf->objs);
>   +                             free_vbuf(vgdev, vbuf);
>   +                             drm_dev_exit(idx);
>   +                             return -ENODEV;
>   +                     }
>   +             } else {
>   +                     wait_event(vgdev->ctrlq.ack_queue,
>   +                                vq->num_free >= elemcnt ||
>   +                                vgdev->vqs_released);
>   +             }
> 
> This approach aborts the wait and allows the graceful shutdown process
> to eventually proceed, albeit with a delay.
> 
> But that approach has drawbacks: it allows users to set arbitrarily
> short timeouts that could destabilize the driver, and aborting commands
> mid-flight is a rough recovery path. An unconditional timeout was also
> discussed previously [2] but is not appropriate without virtio
> specification support.
> 
> This patch takes a safer approach: replace the bare wait_event() with
> wait_event_timeout() and print a warning every 5 seconds while the
> virtqueue remains full. The wait still blocks indefinitely and no
> commands are aborted, so driver behavior is unchanged. The warnings
> help identify an unresponsive host device during troubleshooting.
> Once the user notices the warning, they can work around the hang by
> unbinding the VT from fbcon, removing the device, or forcing a shutdown
> via SysRq.
> 
> [1] https://gist.github.com/YsuOS/fbcd181752594af35f954953a1d260b8
> [2] 
> https://lore.kernel.org/all/[email protected]/
> ---
>  drivers/gpu/drm/virtio/virtgpu_vq.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c 
> b/drivers/gpu/drm/virtio/virtgpu_vq.c
> index 68d097ad9d1d..a546130d3b6a 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
> @@ -410,8 +410,13 @@ static int virtio_gpu_queue_ctrl_sgs(struct 
> virtio_gpu_device *vgdev,
>       if (vq->num_free < elemcnt) {
>               spin_unlock(&vgdev->ctrlq.qlock);
>               virtio_gpu_notify(vgdev);
> -             wait_event(vgdev->ctrlq.ack_queue,
> -                        vq->num_free >= elemcnt || vgdev->vqs_released);
> +             while (!wait_event_timeout(vgdev->ctrlq.ack_queue,
> +                                        vq->num_free >= elemcnt ||
> +                                        vgdev->vqs_released,
> +                                        5 * HZ) && !vgdev->vqs_released)
> +                     DRM_WARN("ctrlq waiting for host: no free space for %d 
> secs\n",
> +                              5);
> +
>               /*
>                * Set by virtio_gpu_release_vqs() to unblock
>                * synchronize_srcu() wait in drm_dev_unplug().
> @@ -592,8 +597,13 @@ static void virtio_gpu_queue_cursor(struct 
> virtio_gpu_device *vgdev,
>       ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
>       if (ret == -ENOSPC) {
>               spin_unlock(&vgdev->cursorq.qlock);
> -             wait_event(vgdev->cursorq.ack_queue,
> -                        vq->num_free >= outcnt || vgdev->vqs_released);
> +             while (!wait_event_timeout(vgdev->cursorq.ack_queue,
> +                                        vq->num_free >= outcnt ||
> +                                        vgdev->vqs_released,
> +                                        5 * HZ) && !vgdev->vqs_released)
> +                     DRM_WARN("cursorq waiting for host: no free space for 
> %d secs\n",
> +                              5);
> +
>               /* See comment in virtio_gpu_queue_ctrl_sgs(). */
>               if (vgdev->vqs_released) {
>                       free_vbuf(vgdev, vbuf);

The [1] has a valid point about the hangcheck warning. Is
wait_event_timeout() worth the change then?

[1]
https://sashiko.dev/#/patchset/20260618-virtiogpu_add_timeout-v1-1-dc36cef609d9%40redhat.com

-- 
Best regards,
Dmitry

Reply via email to