Hi, On 6/18/26 10:18, Ryosuke Yasuoka wrote: > virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() wait for > virtqueue space using wait_event() with vqs_released as the only abort > condition. This covers the device removal path, where > virtio_gpu_release_vqs() sets the flag, but does not help when the host > simply stops processing the virtqueue while the device remains present. > > In that case, the virtqueue fills up and subsequent command submissions > block indefinitely in D state with no diagnostic output, making the root > cause difficult to identify. > > Replace the bare wait_event() with a wait_event_timeout() loop that > prints a warning every 5 seconds while the virtqueue remains full. The > wait still blocks indefinitely so driver behavior is unchanged. The > warnings help identify an unresponsive host device during > troubleshooting. > > Signed-off-by: Ryosuke Yasuoka <[email protected]> > --- > When the host stops processing the virtio-gpu virtqueue without > triggering device removal, the bare wait_event() in > virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() blocks > indefinitely with no diagnostic output. A DRM atomic commit worker > blocks in virtio_gpu_queue_fenced_ctrl_buffer() while holding the > modeset lock. During graceful shutdown, systemd (PID 1) needs the same > lock — either by writing to the console via fbcon, or by closing a DRM > file descriptor that triggers framebuffer cleanup — and blocks as well, > making the VM unrecoverable without a forced power-off. > > PID: 553 COMMAND: "kworker/u4:3" > #0 __schedule > #1 schedule > #2 virtio_gpu_queue_fenced_ctrl_buffer [virtio_gpu] > #3 virtio_gpu_primary_plane_update [virtio_gpu] > ... > > PID: 1 COMMAND: "systemd" (console write path) > #0 __schedule > #1 schedule > #2 schedule_preempt_disabled > #3 __ww_mutex_lock > #4 drm_modeset_lock [drm] > #5 drm_atomic_get_plane_state [drm] > #6 drm_client_modeset_commit_atomic [drm] > #7 drm_client_modeset_commit_locked [drm] > #8 drm_fb_helper_pan_display [drm_kms_helper] > #9 fb_pan_display > #10 bit_update_start > #11 fbcon_switch > #12 redraw_screen > ... > > Reproduction steps: > 1. Build QEMU with the fault injection patch [1] that adds an > x-ctrl-queue-broken property to virtio-gpu. > 2. Boot the VM and trigger the fault injection from the host. > 3. Fill the ctrlq (e.g., move the mouse on the guest's display). > The process gets stuck in virtio_gpu_queue_fenced_ctrl_buffer() > in D state. > 4. Run a graceful shutdown command (shutdown now or reboot). > 5. The shutdown process hangs. > > My earlier patch a46991b334f6 ("drm/virtio: abort virtqueue wait on > device removal to avoid hung task") covers the case where the shutdown > process reaches the device_shutdown() call path, which sets vqs_released > to unblock the wait. However, during graceful shutdown, systemd (PID 1) > gets stuck on the modeset lock before ever reaching device_shutdown(), > so vqs_released is never set and the wait is never unblocked. > > I initially considered adding a module parameter to abort the wait with > -ENODEV on timeout: > > +static unsigned int virtio_gpu_vq_timeout; > +MODULE_PARM_DESC(vq_timeout, > + "Timeout in seconds for virtqueue wait (0 = no timeout, default)"); > +module_param_named(vq_timeout, virtio_gpu_vq_timeout, uint, 0444); > ... > + if (virtio_gpu_vq_timeout) { > + if (!wait_event_timeout(vgdev->ctrlq.ack_queue, > + vq->num_free >= elemcnt || > + vgdev->vqs_released, > + > secs_to_jiffies(virtio_gpu_vq_timeout))) { > + if (fence && vbuf->objs) > + > virtio_gpu_array_unlock_resv(vbuf->objs); > + free_vbuf(vgdev, vbuf); > + drm_dev_exit(idx); > + return -ENODEV; > + } > + } else { > + wait_event(vgdev->ctrlq.ack_queue, > + vq->num_free >= elemcnt || > + vgdev->vqs_released); > + } > > This approach aborts the wait and allows the graceful shutdown process > to eventually proceed, albeit with a delay. > > But that approach has drawbacks: it allows users to set arbitrarily > short timeouts that could destabilize the driver, and aborting commands > mid-flight is a rough recovery path. An unconditional timeout was also > discussed previously [2] but is not appropriate without virtio > specification support. > > This patch takes a safer approach: replace the bare wait_event() with > wait_event_timeout() and print a warning every 5 seconds while the > virtqueue remains full. The wait still blocks indefinitely and no > commands are aborted, so driver behavior is unchanged. The warnings > help identify an unresponsive host device during troubleshooting. > Once the user notices the warning, they can work around the hang by > unbinding the VT from fbcon, removing the device, or forcing a shutdown > via SysRq. > > [1] https://gist.github.com/YsuOS/fbcd181752594af35f954953a1d260b8 > [2] > https://lore.kernel.org/all/[email protected]/ > --- > drivers/gpu/drm/virtio/virtgpu_vq.c | 18 ++++++++++++++---- > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c > b/drivers/gpu/drm/virtio/virtgpu_vq.c > index 68d097ad9d1d..a546130d3b6a 100644 > --- a/drivers/gpu/drm/virtio/virtgpu_vq.c > +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c > @@ -410,8 +410,13 @@ static int virtio_gpu_queue_ctrl_sgs(struct > virtio_gpu_device *vgdev, > if (vq->num_free < elemcnt) { > spin_unlock(&vgdev->ctrlq.qlock); > virtio_gpu_notify(vgdev); > - wait_event(vgdev->ctrlq.ack_queue, > - vq->num_free >= elemcnt || vgdev->vqs_released); > + while (!wait_event_timeout(vgdev->ctrlq.ack_queue, > + vq->num_free >= elemcnt || > + vgdev->vqs_released, > + 5 * HZ) && !vgdev->vqs_released) > + DRM_WARN("ctrlq waiting for host: no free space for %d > secs\n", > + 5); > + > /* > * Set by virtio_gpu_release_vqs() to unblock > * synchronize_srcu() wait in drm_dev_unplug(). > @@ -592,8 +597,13 @@ static void virtio_gpu_queue_cursor(struct > virtio_gpu_device *vgdev, > ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC); > if (ret == -ENOSPC) { > spin_unlock(&vgdev->cursorq.qlock); > - wait_event(vgdev->cursorq.ack_queue, > - vq->num_free >= outcnt || vgdev->vqs_released); > + while (!wait_event_timeout(vgdev->cursorq.ack_queue, > + vq->num_free >= outcnt || > + vgdev->vqs_released, > + 5 * HZ) && !vgdev->vqs_released) > + DRM_WARN("cursorq waiting for host: no free space for > %d secs\n", > + 5); > + > /* See comment in virtio_gpu_queue_ctrl_sgs(). */ > if (vgdev->vqs_released) { > free_vbuf(vgdev, vbuf);
The [1] has a valid point about the hangcheck warning. Is wait_event_timeout() worth the change then? [1] https://sashiko.dev/#/patchset/20260618-virtiogpu_add_timeout-v1-1-dc36cef609d9%40redhat.com -- Best regards, Dmitry

