On Fri, Jun 12, 2026 at 07:57:17PM +0300, Andrey Drobyshev wrote:
From: "Denis V. Lunev" <[email protected]>
Earlier commit ("ms/vhost/vsock: Refuse the connection immediately when
Please follow
https://docs.kernel.org/process/submitting-patches.html#describe-your-changes
on how to refer to a commit.
guest isn't ready") added a fast-fail in vhost_transport_send_pkt(). It
rejects every host send with -EHOSTUNREACH until the destination calls
SET_RUNNING(1). The fast-fail condition checks whether device's backends
are dropped, and if they're, the guest is considered to be not ready.
Okay, so it's not a regression, I mean without this series that patch is
not adding any regression, no?
If it's the case, I'll change the wording in the cover letter.
However, there might be other reasons for backends to be nulled. In
particular, when QEMU is performing CPR (checkpoint-restore) migration,
device ownership is being RESET and SET again, which leads to backends
drop and reattach. If we end up connecting during this window, an
AF_VSOCK client gets -EHOSTUNREACH, which is wrong.
Please add this change before starting to support VHOST_RESET_OWNER
ioctl in vhost-vsock, otherwise we are breaking the bisectability.
Add a cpr_paused flag set inside vhost_vsock_drop_backends() when the
backend was previously live, cleared by vhost_vsock_start(). When set,
vhost_transport_send_pkt() queues the skb instead of fast-failing; the
existing kick of send_pkt_work in vhost_vsock_start() drains it on
resume. A device that has never run keeps cpr_paused == false and the
boot-time fast-fail behaviour is preserved.
Pair the cpr_paused store with the backend store using an
smp_wmb()/smp_rmb() pair so a concurrent sender on a weakly-ordered
architecture never observes (NULL backend, !paused):
Signed-off-by: Denis V. Lunev <[email protected]>
---
drivers/vhost/vsock.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e629886e5cf8..bcaba36becd7 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -61,6 +61,7 @@ struct vhost_vsock {
u32 guest_cid;
bool seqpacket_allow;
+ bool cpr_paused; /* between stop and next start */
};
static u32 vhost_transport_get_local_cid(void)
@@ -311,11 +312,17 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net
*net)
* the mutex would be too expensive in this hot path, and we already
have
* all the outcomes covered: if the backend becomes NULL right after
the check,
* vhost_transport_do_send_pkt() will check it under the mutex anyway.
+ *
+ * Don't fast-fail if cpr_paused is set, keep queueing skbs instead.
+ * The kick in vhost_vsock_start() will drain them on resume.
*/
if
(unlikely(!data_race(vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])))) {
- rcu_read_unlock();
- kfree_skb(skb);
- return -EHOSTUNREACH;
+ smp_rmb(); /* pairs with smp_wmb() in start/drop_backends
*/
+ if (!READ_ONCE(vsock->cpr_paused)) {
Can we avoid this which is not really readable and maybe add a single
variable to control the fast-fail at all?
I mean replacing both cpr_paused + backend-pointer with a single
`started` flag: set it to false at open, true on start via
smp_store_release(), back to false on normal stop, and leave it true
during CPR pause.
The reader in send_pkt can do just:
if (!smp_load_acquire(&vsock->started))
return -EHOSTUNREACH;
WDYT?
+ rcu_read_unlock();
+ kfree_skb(skb);
+ return -EHOSTUNREACH;
+ }
That said claude here is reporting a potential issue that I think we
should consider:
After VHOST_RESET_OWNER, the guest CID stays in the hash, so
vhost_transport_send_pkt() can still find the vsock, skip the
fast-fail (cpr_paused=true), and call vhost_vq_work_queue() while
vhost_workers_free() is freeing workers without a synchronize_rcu()
— risking a use-after-free. Also, any send_pkt_work queued between
the last flush and worker teardown gets its VHOST_WORK_QUEUED bit
stuck (the vhost task exits without draining), deadlocking
host→guest traffic after restart.
A synchronize_rcu() in vhost_workers_free() between the
rcu_assign_pointer(NULL) loop and the destroy loop would close the
use-after-free, and reinitializing send_pkt_work via
vhost_work_init() after vhost_dev_reset_owner() returns would clear
the stuck QUEUED bit.
}
if (virtio_vsock_skb_reply(skb))
@@ -640,6 +647,9 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
mutex_unlock(&vq->mutex);
}
+ smp_wmb(); /* pairs with smp_rmb() in send_pkt */
+ WRITE_ONCE(vsock->cpr_paused, false);
+
/* Some packets may have been queued before the device was started,
* let's kick the send worker to send them.
*/
@@ -671,6 +681,11 @@ static void vhost_vsock_drop_backends(struct vhost_vsock
*vsock)
lockdep_assert_held(&vsock->dev.mutex);
+ if (vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
+ WRITE_ONCE(vsock->cpr_paused, true);
+ smp_wmb(); /* pairs with smp_rmb() in send_pkt */
+ }
Why here and not in vhost_vsock_reset_owner()?
Also having this here will set it to true also with
VHOST_VSOCK_SET_RUNNING(0), is that right?
Thanks,
Stefano
+
for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++) {
vq = &vsock->vqs[i];
@@ -728,6 +743,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct
file *file)
vsock->guest_cid = 0; /* no CID assigned yet */
vsock->seqpacket_allow = false;
+ vsock->cpr_paused = false;
atomic_set(&vsock->queued_replies, 0);
--
2.47.1