> From: "Michael S. Tsirkin"<[email protected]> > Date: Mon, Mar 2, 2026, 08:45 > Subject: Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet > exceeds virtqueue capacity > To: "ShuangYu"<[email protected]> > Cc: "[email protected]"<[email protected]>, > "[email protected]"<[email protected]>, > "[email protected]"<[email protected]>, > "[email protected]"<[email protected]>, > "[email protected]"<[email protected]> > On Sun, Mar 01, 2026 at 07:39:30PM -0500, Michael S. Tsirkin wrote: > > On Sun, Mar 01, 2026 at 07:10:06PM -0500, Michael S. Tsirkin wrote: > > > On Sun, Mar 01, 2026 at 10:36:39PM +0000, ShuangYu wrote: > > > > Hi, > > > > > > > > We have hit a severe livelock in vhost_net on 6.18.x. The vhost > > > > kernel thread spins at 100% CPU indefinitely in handle_rx(), and > > > > QEMU becomes unkillable (stuck in D state). > > > > [This is a text/plain messages] > > > > > > > > Environment > > > > ----------- > > > > Kernel: 6.18.10-1.el8.elrepo.x86_64 > > > > QEMU: 7.2.19 > > > > Virtio: VIRTIO_F_IN_ORDER is negotiated > > > > Backend: vhost (kernel) > > > > > > > > Symptoms > > > > -------- > > > > - vhost-<pid> kernel thread at 100% CPU (R state, never yields) > > > > - QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM > > > > - kill -9 has no effect on the QEMU process > > > > - libvirt management plane deadlocks ("cannot acquire state change > > > > lock") > > > > > > > > Root Cause > > > > ---------- > > > > The livelock is triggered when a GRO-merged packet on the host TAP > > > > interface (e.g., ~60KB) exceeds the remaining free capacity of the > > > > guest's RX virtqueue (e.g., ~40KB of available buffers). > > > > > > > > The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows: > > > > > > > > 1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors. > > > > It advances vq->last_avail_idx and vq->next_avail_head as it > > > > consumes buffers, but runs out before satisfying datalen. > > > > > > > > 2. get_rx_bufs() jumps to err: and calls > > > > vhost_discard_vq_desc(vq, headcount, n), which rolls back > > > > vq->last_avail_idx and vq->next_avail_head. > > > > > > > > Critically, vq->avail_idx (the cached copy of the guest's > > > > avail->idx) is NOT rolled back. This is correct behavior in > > > > isolation, but creates a persistent mismatch: > > > > > > > > vq->avail_idx = 108 (cached, unchanged) > > > > vq->last_avail_idx = 104 (rolled back) > > > > > > > > 3. handle_rx() sees headcount == 0 and calls vhost_enable_notify(). > > > > Inside, vhost_get_avail_idx() finds: > > > > > > > > vq->avail_idx (108) != vq->last_avail_idx (104) > > > > > > > > It returns 1 (true), indicating "new buffers available." > > > > But these are the SAME buffers that were just discarded. > > > > > > > > 4. handle_rx() hits `continue`, restarting the loop. > > > > > > > > 5. In the next iteration, vhost_get_vq_desc_n() checks: > > > > > > > > if (vq->avail_idx == vq->last_avail_idx) > > > > > > > > This is FALSE (108 != 104), so it skips re-reading the guest's > > > > actual avail->idx and directly fetches the same descriptors. > > > > > > > > 6. The exact same sequence repeats: fetch -> too small -> discard > > > > -> rollback -> "new buffers!" -> continue. Indefinitely. > > > > > > > > This appears to be a regression introduced by the VIRTIO_F_IN_ORDER > > > > support, which added vhost_get_vq_desc_n() with the cached avail_idx > > > > short-circuit check, and the two-argument vhost_discard_vq_desc() > > > > with next_avail_head rollback. The mismatch between the rollback > > > > scope (last_avail_idx, next_avail_head) and the check scope > > > > (avail_idx vs last_avail_idx) was not present before this change. > > > > > > > > bpftrace Evidence > > > > ----------------- > > > > During the 100% CPU lockup, we traced: > > > > > > > > @get_rx_ret[0]: 4468052 // get_rx_bufs() returns 0 every time > > > > @peek_ret[60366]: 4385533 // same 60KB packet seen every > > > > iteration > > > > @sock_err[recvmsg]: 0 // tun_recvmsg() is never reached > > > > > > > > vhost_get_vq_desc_n() was observed iterating over the exact same 11 > > > > descriptor addresses millions of times per second. > > > > > > > > Workaround > > > > ---------- > > > > Either of the following avoids the livelock: > > > > > > > > - Disable GRO/GSO on the TAP interface: > > > > ethtool -K <tap> gro off gso off > > > > > > > > - Switch from kernel vhost to userspace QEMU backend: > > > > <driver name='qemu'/> in libvirt XML > > > > > > > > Bisect > > > > ------ > > > > We have not yet completed a full git bisect, but the issue does not > > > > occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost > > > > support. We will follow up with a Fixes: tag if we can identify the > > > > exact commit. > > > > > > > > Suggested Fix Direction > > > > ----------------------- > > > > In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to > > > > insufficient buffers (not because the queue is truly empty), the code > > > > should break out of the loop rather than relying on > > > > vhost_enable_notify() to make that determination. For example, when > > > > get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a > > > > "packet too large" condition, not a "queue empty" condition, and > > > > should be handled differently. > > > > > > > > Thanks, > > > > ShuangYu > > > > > > Hmm. On a hunch, does the following help? completely untested, > > > it is night here, sorry. > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > index 2f2c45d20883..aafae15d5156 100644 > > > --- a/drivers/vhost/vhost.c > > > +++ b/drivers/vhost/vhost.c > > > @@ -1522,6 +1522,7 @@ static void vhost_dev_unlock_vqs(struct vhost_dev > > > *d) > > > static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > > > { > > > __virtio16 idx; > > > + u16 avail_idx; > > > int r; > > > > > > r = vhost_get_avail(vq, idx, &vq->avail->idx); > > > @@ -1532,17 +1533,19 @@ static inline int vhost_get_avail_idx(struct > > > vhost_virtqueue *vq) > > > } > > > > > > /* Check it isn't doing very strange thing with available > > > indexes */ > > > - vq->avail_idx = vhost16_to_cpu(vq, idx); > > > - if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > > > > vq->num)) { > > > + avail_idx = vhost16_to_cpu(vq, idx); > > > + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { > > > vq_err(vq, "Invalid available index change from %u to > > > %u", > > > vq->last_avail_idx, vq->avail_idx); > > > return -EINVAL; > > > } > > > > > > /* We're done if there is nothing new */ > > > - if (vq->avail_idx == vq->last_avail_idx) > > > + if (avail_idx == vq->avail_idx) > > > return 0; > > > > > > + vq->avail_idx == avail_idx; > > > + > > > > meaning > > vq->avail_idx = avail_idx; > > of course > > > > > /* > > > * We updated vq->avail_idx so we need a memory barrier between > > > * the index read above and the caller reading avail ring > > > entries. > > > and the change this is fixing was done in > d3bb267bbdcba199568f1325743d9d501dea0560 > > -- > MST >
Thank you for the quick fix and for identifying the root commit. I've reviewed the patch and I believe the logic is correct — changing the "nothing new" check in vhost_get_avail_idx() from comparing against vq->last_avail_idx to comparing against the cached vq->avail_idx makes it immune to the rollback done by vhost_discard_vq_desc(), which is exactly what breaks the loop. One minor nit: the vq_err message on the sanity check path still references vq->avail_idx before it has been updated: vq_err(vq, "Invalid available index change from %u to %u", - vq->last_avail_idx, vq->avail_idx); + vq->last_avail_idx, avail_idx); Since this issue was found in production, I need some time to prepare a test setup to verify the patch Thanks, ShuangYu

