On 5/7/26 22:13, Polina Vishneva wrote:
From: "Denis V. Lunev" <[email protected]>

When the host initiates an AF_VSOCK connect() to a guest that has not
yet loaded the virtio-vsock transport (i.e. still booting), the caller
blocks for VSOCK_DEFAULT_CONNECT_TIMEOUT (2 seconds), because
vhost_transport_do_send_pkt() silently exits when
vhost_vq_get_backend(vq) returns NULL.

If the guest doesn't start listening within this timeout, connect()
returns ETIMEDOUT.

This delay is usually pointless and it doesn't well align with our
behavior at other initialization stages: for example, if a connection is
attempted when the guest driver is already loaded, but when nothing is
listening yet, it returns ECONNRESET immediately without any wait.

Fix this by checking the RX virtqueue backend in
vhost_transport_send_pkt() before queuing. If the backend is NULL,
return -ECONNREFUSED immediately.

Signed-off-by: Denis V. Lunev <[email protected]>
Co-authored-by: Polina Vishneva <[email protected]>
Signed-off-by: Polina Vishneva <[email protected]>
---
  drivers/vhost/vsock.c | 17 ++++++++++++++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 1d8ec6bed53e..e6de1e23121b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -302,6 +302,20 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net 
*net)
                return -ENODEV;
        }
+ /* If the guest has not yet initialized the RX virtqueue, fail
+        * immediately rather than queueing the packet and letting the
+        * caller wait for VSOCK_DEFAULT_CONNECT_TIMEOUT.
+        *
+        * Reading private_data without vq->mutex is a deliberate racy
+        * check: if the backend is NULL the guest driver is definitely
+        * not ready; if it becomes NULL right after, the worker
+        * (do_send_pkt) rechecks under the mutex. */
+       if (!READ_ONCE(vsock->vqs[VSOCK_VQ_RX].private_data)) {
+               rcu_read_unlock();
+               kfree_skb(skb);
+               return -ECONNREFUSED;

i'm a bit hesitating about the proper error code returned here.
Who receives this error code eventually and how does it process it?

i mean - we are in a process on a VM start, but it has not been fully 
initialized yet.
But we believe it will be initialized soon, so i'd expect the attempt should be 
repeated in a while.

On the other hand i'm not sure the process when gets -ECONNREFUSED, will 
definitely retries the attempt.

May be to use -EAGAIN here - this error code definitely is expected when a new 
attempt is expected.

AI also suggests -EHOSTUNREACH (and by the way - AI does not recommend EAGAIN 
he-he :)))  ).

  EHOSTUNREACH as the error code for "guest transport not ready"

Semantics: EHOSTUNREACH means "the destination host cannot be reached" - the peer exists conceptually but the communication path to it is currently unavailable. This maps precisely to the situation: the guest VM exists, QEMU has opened the vhost-vsock device and assigned a CID, but the guest has not yet loaded its virtio-vsock driver, so the
  transport path is not established.

  Existing usage in vsock subsystem:

• vmci_transport.c:95 - VMCI_ERROR_INVALID_RESOURCE is mapped to EHOSTUNREACH. This is the case where the VMCI endpoint for the peer cannot be located - the peer's transport resource does not exist yet or has been destroyed.

• vmci_transport_notify.c:436,525 - returned when send_waiting_read() / send_waiting_write() fails, meaning the
    notification could not reach the peer. The peer is considered unreachable.

Both cases share the same pattern: the peer is known to exist (has a CID, was previously connected, etc.) but the
  transport layer cannot deliver data to it right now.

  Why it fits better than ECONNREFUSED:

• ECONNREFUSED implies the peer received the request and actively rejected it (e.g., nothing listening on that port). Here the guest never sees the request at all - the virtqueue backend is NULL, so the packet cannot even enter the
    guest.

• EHOSTUNREACH implies the packet could not be routed/delivered to the destination. This is exactly what happens - the
     RX virtqueue has no backend, so delivery is impossible.

  Userspace behavior:

• Programs and retry frameworks commonly treat EHOSTUNREACH as a transient condition worth retrying (the host may come up), whereas ECONNREFUSED is typically treated as "service does not exist at this address" and not retried.

• For the specific use case (host connecting to a guest that is still booting), retry is the correct behavior - the
    guest will eventually load its driver and become reachable.

It is a standard connect() error code - unlike EAGAIN, which is not expected from connect() and would confuse most
  userspace socket code.

+       }
+
        if (virtio_vsock_skb_reply(skb))
                atomic_inc(&vsock->queued_replies);
@@ -624,9 +638,6 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
                mutex_unlock(&vq->mutex);
        }
- /* Some packets may have been queued before the device was started,
-        * let's kick the send worker to send them.
-        */
        vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);

i think the vhost_vq_work_queue() call should be removed as well here, not only 
the comment.


  Before the patch: packets accumulate while backend is NULL

  Timeline from the QEMU/host perspective:

1. QEMU opens /dev/vhost-vsock - struct vhost_vsock is created, but virtqueue backend (private_data) is still NULL.

2. QEMU issues ioctl(VHOST_VSOCK_SET_GUEST_CID) - sets vsock->guest_cid, inserts vsock into vhost_vsock_hash. From this point vhost_vsock_get(cid) can find it.

3. Guest is still booting, virtio-vsock driver not loaded yet. But the vsock is already discoverable by CID lookup.

  4. Host calls connect() - the packet gets queued but cannot be delivered:

  connect(fd, {AF_VSOCK, guest_cid, port})
    vsock_connect()                                [af_vsock.c:1650]
      transport->connect(vsk)                      [af_vsock.c:1730]
        virtio_transport_connect()                 
[virtio_transport_common.c:1076]
          virtio_transport_send_pkt_info()         
[virtio_transport_common.c:328]
            t_ops->send_pkt(skb, net)
              vhost_transport_send_pkt()           [vsock.c:289]
                vhost_vsock_get(dst_cid) -> found  (CID already in hash)
                virtio_vsock_skb_queue_tail()      ← PACKET QUEUED
                vhost_vq_work_queue()              ← WORKER KICKED
                return len                         ← SUCCESS (positive)

  Worker wakes up but cannot deliver:

  vhost_transport_send_pkt_work()
    vhost_transport_do_send_pkt(vsock, vq)         [vsock.c:107]
      mutex_lock(&vq->mutex)
      vhost_vq_get_backend(vq) == NULL             ← guest not ready
      goto out                                     ← PACKET STAYS IN QUEUE
      mutex_unlock(&vq->mutex)

Back in vsock_connect() - transport->connect() returned success (len > 0), so the code enters the wait loop:

      sk->sk_state = TCP_SYN_SENT;
      err = transport->connect(vsk);     → returns len (success)
      if (err < 0) goto out;             → NOT taken
      ...
      while (sk->sk_state != TCP_ESTABLISHED && ...) {
          timeout = schedule_timeout(timeout);     ← SLEEPS 2 SECONDS
          if (timeout == 0) {
              err = -ETIMEDOUT;                    ← GIVES UP
          }
      }

The guest never receives the CONNECT request (it is stuck in the queue), so no response arrives, and connect() returns ETIMEDOUT after 2 seconds.

5. Later the guest finishes booting, loads the virtio-vsock driver, negotiates virtqueues. QEMU issues ioctl(VHOST_VSOCK_SET_RUNNING, 1) which calls vhost_vsock_start():

  vhost_vsock_start()                              [vsock.c:609]
    for each vq:
      mutex_lock(&vq->mutex)
      vhost_vq_set_backend(vq, vsock)              ← backend becomes NON-NULL
      mutex_unlock(&vq->mutex)
    vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX],  ← KICKS WORKER AGAIN
                        &vsock->send_pkt_work)

Worker wakes up, now vhost_vq_get_backend(vq) != NULL, delivers the queued packet to the guest. But it is too late - connect() on the host side already timed out.

Why the kick in vhost_vsock_start() is essential here: between steps 4 and 5 nobody else will wake the worker. The kick from step 4 already fired and did nothing (backend was NULL). No new packets are coming - the only connect() caller is sleeping. Without this kick the packet would remain in the queue forever.

  ────────────────────────────────────────

  After the patch: packets no longer accumulate

  Same initial conditions - QEMU has set the CID, guest is still booting.

  Host calls connect():

  connect(fd, {AF_VSOCK, guest_cid, port})
    vsock_connect()                                [af_vsock.c:1650]
      transport->connect(vsk)                      [af_vsock.c:1730]
        virtio_transport_connect()                 
[virtio_transport_common.c:1076]
          virtio_transport_send_pkt_info()         
[virtio_transport_common.c:328]
            t_ops->send_pkt(skb, net)
              vhost_transport_send_pkt()           [vsock.c:289]
                vhost_vsock_get(dst_cid) -> found
                READ_ONCE(vsock->vqs[VSOCK_VQ_RX].private_data) == NULL
                kfree_skb(skb)                     ← PACKET FREED
                return -ECONNREFUSED               ← ERROR RETURNED

  The error propagates back immediately:

          virtio_transport_send_pkt_info():
            ret = t_ops->send_pkt(skb, net)  → -ECONNREFUSED
            if (ret < 0) break               → breaks out
        virtio_transport_connect() returns -ECONNREFUSED
      vsock_connect():
        err = transport->connect(vsk)        → -ECONNREFUSED
        if (err < 0) goto out                → TAKEN, skips wait loop
    connect() returns ECONNREFUSED to userspace immediately

The packet never enters send_pkt_queue. When vhost_vsock_start() runs later, the queue is guaranteed to be empty - there is nothing for the worker kick to flush.

  ────────────────────────────────────────

Summary: SET_GUEST_CID makes the vsock discoverable, SET_RUNNING actually enables the virtqueues. Between these two ioctls there is a window where packets are accepted into the queue but cannot be delivered. The kick in vhost_vsock_start() existed to drain this backlog. The patch closes the window at the entry point instead - refusing packets outright - so the backlog can never form.

mutex_unlock(&vsock->dev.mutex);

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to