On Mon, 18 May 2026 at 18:16, Stefano Garzarella <[email protected]> wrote:
>
> On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote:
> >vsock_poll() reads vsk->peer_shutdown before taking the socket
> >lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
> >lock to report EOF readability. A shutdown packet can update
> >peer_shutdown while poll is waiting for the lock, so one poll invocation
> >can report EPOLLIN without the corresponding HUP/RDHUP bits.
> >
> >Keep non-connectible sockets on a single lockless READ_ONCE()
>
> Should this be paired with WRITE_ONCE() on writers?

Yes, since the poll path uses lockless READ_ONCE() snapshots of
peer_shutdown, the writer side should be annotated with WRITE_ONCE() as
well. I will add that in v2.

>
> >snapshot. For connectible sockets, defer shutdown-derived poll bits
> >until after lock_sock() and use one READ_ONCE() snapshot for both EOF
> >readability and HUP/RDHUP. This preserves shutdowns that arrive before
> >the lock is acquired and keeps all peer-shutdown-derived bits consistent
> >for a poll pass.
> >
> >Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
> >Signed-off-by: Ziyu Zhang <[email protected]>
> >---
> > net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
> > 1 file changed, 26 insertions(+), 14 deletions(-)
> >
> >diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >index adcba1b7b..bed42347b 100644
> >--- a/net/vmw_vsock/af_vsock.c
> >+++ b/net/vmw_vsock/af_vsock.c
> >@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int 
> >mode)
> >       return err;
> > }
> >
> >+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
> >+{
> >+      __poll_t mask = 0;
> >+
> >+      /* INET sockets treat local write shutdown and peer write shutdown as 
> >a
> >+       * case of EPOLLHUP set.
> >+       */
> >+      if (sk->sk_shutdown == SHUTDOWN_MASK ||
> >+          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >+           (peer_shutdown & SEND_SHUTDOWN)))
> >+              mask |= EPOLLHUP;
> >+
> >+      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >+          peer_shutdown & SEND_SHUTDOWN)
> >+              mask |= EPOLLRDHUP;
> >+
> >+      return mask;
> >+}
> >+
> > static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >                              poll_table *wait)
> > {
> >@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, struct 
> >socket *sock,
> >               /* Signify that there has been an error on this socket. */
> >               mask |= EPOLLERR;
> >
> >-      /* INET sockets treat local write shutdown and peer write shutdown as 
> >a
> >-       * case of EPOLLHUP set.
> >-       */
> >-      if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
> >-          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >-           (vsk->peer_shutdown & SEND_SHUTDOWN))) {
> >-              mask |= EPOLLHUP;
> >-      }
> >-
> >-      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >-          vsk->peer_shutdown & SEND_SHUTDOWN) {
> >-              mask |= EPOLLRDHUP;
> >-      }
> >+      if (!sock_type_connectible(sk->sk_type))
> >+              mask |= vsock_poll_shutdown(sk,
> >+                                          READ_ONCE(vsk->peer_shutdown));
>
> Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ?
>
> Not a strong opinion about that, but in any case IMO we should add a
> comment here to explain why we are doing only for not connectible
> sockets.
>
> That said, if we use WRITE_ONCE in the writers, do we really need to
> move this after the lock_sock for the connectable ones?

Yes, I will move the non-connectible handling into the SOCK_DGRAM branch
and add a comment there.

For connectible sockets, my current understanding is that
READ_ONCE()/WRITE_ONCE() make the individual lockless accesses explicit, but
they do not make two separate reads in one vsock_poll() invocation observe the
same peer_shutdown value. So I still think using one peer_shutdown snapshot
after lock_sock() is useful for keeping the returned mask internally
consistent. Please let me know if you think WRITE_ONCE() is enough for this
case.

>
> >
> >       if (sk_is_readable(sk))
> >               mask |= EPOLLIN | EPOLLRDNORM;
> >@@ -1171,6 +1180,7 @@ static __poll_t vsock_poll(struct file *file, struct 
> >socket *sock,
> >
> >       } else if (sock_type_connectible(sk->sk_type)) {
> >               const struct vsock_transport *transport;
> >+              u32 peer_shutdown;
> >
> >               lock_sock(sk);
> >
> >@@ -1203,10 +1213,12 @@ static __poll_t vsock_poll(struct file *file, struct 
> >socket *sock,
> >                * terminated should also be considered read, and we check the
> >                * shutdown flag for that.
> >                */
> >+              peer_shutdown = READ_ONCE(vsk->peer_shutdown);
> >               if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >-                  vsk->peer_shutdown & SEND_SHUTDOWN) {
> >+                  peer_shutdown & SEND_SHUTDOWN) {
> >                       mask |= EPOLLIN | EPOLLRDNORM;
> >               }
> >+              mask |= vsock_poll_shutdown(sk, peer_shutdown);
>
> nit: to keep the order the same as before, I would move this call just
> before this `if` block, but I don't think it makes any difference in the
> end.

Sure, I will restore the previous ordering in v2.

Thanks,
Ziyu

>
> >
> >               /* Connected sockets that can produce data can be written. */
> >               if (transport && sk->sk_state == TCP_ESTABLISHED) {
> >--
> >2.43.0
> >
>

Reply via email to