On Tue, 19 May 2026 at 17:36, Stefano Garzarella <[email protected]> wrote:
>
> On Mon, May 18, 2026 at 08:39:37PM +0800, ziyu zhang wrote:
> >On Mon, 18 May 2026 at 18:16, Stefano Garzarella <[email protected]> wrote:
> >>
> >> On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote:
> >> >vsock_poll() reads vsk->peer_shutdown before taking the socket
> >> >lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the
> >> >lock to report EOF readability. A shutdown packet can update
> >> >peer_shutdown while poll is waiting for the lock, so one poll invocation
> >> >can report EPOLLIN without the corresponding HUP/RDHUP bits.
> >> >
> >> >Keep non-connectible sockets on a single lockless READ_ONCE()
> >>
> >> Should this be paired with WRITE_ONCE() on writers?
> >
> >Yes, since the poll path uses lockless READ_ONCE() snapshots of
> >peer_shutdown, the writer side should be annotated with WRITE_ONCE() as
> >well. I will add that in v2.
> >
> >>
> >> >snapshot. For connectible sockets, defer shutdown-derived poll bits
> >> >until after lock_sock() and use one READ_ONCE() snapshot for both EOF
> >> >readability and HUP/RDHUP. This preserves shutdowns that arrive before
> >> >the lock is acquired and keeps all peer-shutdown-derived bits consistent
> >> >for a poll pass.
> >> >
> >> >Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
> >> >Signed-off-by: Ziyu Zhang <[email protected]>
> >> >---
> >> > net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++--------------
> >> > 1 file changed, 26 insertions(+), 14 deletions(-)
> >> >
> >> >diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> >> >index adcba1b7b..bed42347b 100644
> >> >--- a/net/vmw_vsock/af_vsock.c
> >> >+++ b/net/vmw_vsock/af_vsock.c
> >> >@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int 
> >> >mode)
> >> >       return err;
> >> > }
> >> >
> >> >+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown)
> >> >+{
> >> >+      __poll_t mask = 0;
> >> >+
> >> >+      /* INET sockets treat local write shutdown and peer write shutdown 
> >> >as a
> >> >+       * case of EPOLLHUP set.
> >> >+       */
> >> >+      if (sk->sk_shutdown == SHUTDOWN_MASK ||
> >> >+          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >> >+           (peer_shutdown & SEND_SHUTDOWN)))
> >> >+              mask |= EPOLLHUP;
> >> >+
> >> >+      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >> >+          peer_shutdown & SEND_SHUTDOWN)
> >> >+              mask |= EPOLLRDHUP;
> >> >+
> >> >+      return mask;
> >> >+}
> >> >+
> >> > static __poll_t vsock_poll(struct file *file, struct socket *sock,
> >> >                              poll_table *wait)
> >> > {
> >> >@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, 
> >> >struct socket *sock,
> >> >               /* Signify that there has been an error on this socket. */
> >> >               mask |= EPOLLERR;
> >> >
> >> >-      /* INET sockets treat local write shutdown and peer write shutdown 
> >> >as a
> >> >-       * case of EPOLLHUP set.
> >> >-       */
> >> >-      if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
> >> >-          ((sk->sk_shutdown & SEND_SHUTDOWN) &&
> >> >-           (vsk->peer_shutdown & SEND_SHUTDOWN))) {
> >> >-              mask |= EPOLLHUP;
> >> >-      }
> >> >-
> >> >-      if (sk->sk_shutdown & RCV_SHUTDOWN ||
> >> >-          vsk->peer_shutdown & SEND_SHUTDOWN) {
> >> >-              mask |= EPOLLRDHUP;
> >> >-      }
> >> >+      if (!sock_type_connectible(sk->sk_type))
> >> >+              mask |= vsock_poll_shutdown(sk,
> >> >+                                          READ_ONCE(vsk->peer_shutdown));
> >>
> >> Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ?
> >>
> >> Not a strong opinion about that, but in any case IMO we should add a
> >> comment here to explain why we are doing only for not connectible
> >> sockets.
> >>
> >> That said, if we use WRITE_ONCE in the writers, do we really need to
> >> move this after the lock_sock for the connectable ones?
> >
> >Yes, I will move the non-connectible handling into the SOCK_DGRAM branch
> >and add a comment there.
> >
> >For connectible sockets, my current understanding is that
> >READ_ONCE()/WRITE_ONCE() make the individual lockless accesses explicit, but
> >they do not make two separate reads in one vsock_poll() invocation observe 
> >the
> >same peer_shutdown value. So I still think using one peer_shutdown snapshot
> >after lock_sock() is useful for keeping the returned mask internally
> >consistent. Please let me know if you think WRITE_ONCE() is enough for this
> >case.
>
> What will be the issue of "do not make two separate reads in one
> vsock_poll() invocation observe the same peer_shutdown value." ?
>
> In any case, I'm not against it; I just want to understand the issue
> better.

The issue I was trying to avoid is a transiently inconsistent poll mask
from one vsock_poll() pass.

For example, the early peer_shutdown read can see 0, so the
shutdown-derived bits are not set, especially EPOLLRDHUP, and EPOLLHUP
in the cases where the existing logic would set it. Then, before
lock_sock() succeeds, the peer shutdown can be processed. The later read
after lock_sock() can see SEND_SHUTDOWN and set EPOLLIN|EPOLLRDNORM for
EOF readability.

So the returned mask can say that EOF is readable, but miss the
shutdown-derived indication from the same peer_shutdown state. If
userspace is waiting specifically for EPOLLRDHUP, that notification can
be missed or delayed for that poll pass.

I agree this is small and transient. A following read() or another poll
pass will observe the shutdown state. My intention with the single
snapshot is only to avoid mixing old and new peer_shutdown values in one
returned mask.

I will send a v2 as a new thread with WRITE_ONCE() on the writers, the
DGRAM comment, and your ordering suggestion.

Thanks,
Ziyu

>
> Thanks,
> Stefano
>

Reply via email to