On 18/05/2026 14:08, Stefano Garzarella wrote:
> On Mon, May 18, 2026 at 11:50:05AM +0100, David Laight wrote:
>> On Mon, 18 May 2026 11:54:19 +0200
>> Stefano Garzarella <[email protected]> wrote:
>>
>>> On Mon, May 18, 2026 at 05:33:08AM -0400, Michael S. Tsirkin wrote:
>>> >On Mon, May 18, 2026 at 11:18:24AM +0200, Stefano Garzarella wrote:
>>> >> On Sat, May 16, 2026 at 12:53:29PM +0100, David Laight wrote:
>>> >> > On Thu, 14 May 2026 11:29:48 +0200
>>> >> > Stefano Garzarella <[email protected]> wrote:
>>> >> >
>>> >> > > From: Stefano Garzarella <[email protected]>
>>> >> > >
>>> >> > > When a large message is fragmented into multiple skbs, the zerocopy
>>> >> > > uarg is only allocated and attached to the last skb in the loop.
>>> >> > > Non-final skbs carry pinned user pages with no completion tracking,
>>> >> > > so the kernel has no way to notify userspace when those pages are
>>> >> > > safe
>>> >> > > to reuse. If the loop breaks early the uarg is never allocated at
>>> >> > > all,
>>> >> > > leaking pinned pages with no completion notification.
>>> >> > >
>>> >> > > Fix this by following the approach used by TCP: allocate the zerocopy
>>> >> > > uarg (if not provided by the caller) before the send loop and attach
>>> >> > > it to every skb via skb_zcopy_set(), which takes a reference per skb.
>>> >> > > Each skb's completion properly decrements the refcount, and the
>>> >> > > notification only fires after the last skb is freed.
>>> >> > > On failure, if no data was sent, the uarg is cleanly aborted via
>>> >> > > net_zcopy_put_abort().
>>> >> > >
>>> >> > > This issue was initially discovered by sashiko while reviewing commit
>>> >> > > 1cb36e252211 ("vsock/virtio: fix MSG_ZEROCOPY pinned-pages
>>> >> > > accounting")
>>> >> > > but was pre-existing.
>>> >> > >
>>> >> > > Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
>>> >> > > Cc: Arseniy Krasnov <[email protected]>
>>> >> > > Closes:
>>> >> > > https://sashiko.dev/#/patchset/20260420132051.217589-1-sgarzare%40redhat.com
>>> >> > > Reported-by: Maher Azzouzi <[email protected]>
>>> >> > > Signed-off-by: Stefano Garzarella <[email protected]>
>>> >> > > ---
>>> >> > > net/vmw_vsock/virtio_transport_common.c | 83
>>> >> > >++++++++++---------------
>>> >> > > 1 file changed, 34 insertions(+), 49 deletions(-)
>>> >> > >
>>> >> > > diff --git a/net/vmw_vsock/virtio_transport_common.c
>>> >> > > b/net/vmw_vsock/virtio_transport_common.c
>>> >> > > index 989cc252d3d3..1e3409d28164 100644
>>> >> > > --- a/net/vmw_vsock/virtio_transport_common.c
>>> >> > > +++ b/net/vmw_vsock/virtio_transport_common.c
>>> >> > > @@ -70,34 +70,6 @@ static bool virtio_transport_can_zcopy(const
>>> >> > > struct virtio_transport *t_ops,
>>> >> > > return true;
>>> >> > > }
>>> >> > >
>>> >> > > -static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>> >> > > - struct sk_buff *skb,
>>> >> > > - struct msghdr *msg,
>>> >> > > - size_t pkt_len,
>>> >> > > - bool zerocopy)
>>> >> > > -{
>>> >> > > - struct ubuf_info *uarg;
>>> >> > > -
>>> >> > > - if (msg->msg_ubuf) {
>>> >> > > - uarg = msg->msg_ubuf;
>>> >> > > - net_zcopy_get(uarg);
>>> >> > > - } else {
>>> >> > > - struct ubuf_info_msgzc *uarg_zc;
>>> >> > > -
>>> >> > > - uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> >> > > - pkt_len, NULL, false);
>>> >> > > - if (!uarg)
>>> >> > > - return -1;
>>> >> > > -
>>> >> > > - uarg_zc = uarg_to_msgzc(uarg);
>>> >> > > - uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>> >> > > - }
>>> >> > > -
>>> >> > > - skb_zcopy_init(skb, uarg);
>>> >> > > -
>>> >> > > - return 0;
>>> >> > > -}
>>> >> > > -
>>> >> > > static int virtio_transport_fill_skb(struct sk_buff *skb,
>>> >> > > struct virtio_vsock_pkt_info *info,
>>> >> > > size_t len,
>>> >> > > @@ -317,8 +289,10 @@ static int
>>> >> > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> >> > > u32 src_cid, src_port, dst_cid, dst_port;
>>> >> > > const struct virtio_transport *t_ops;
>>> >> > > struct virtio_vsock_sock *vvs;
>>> >> > > + struct ubuf_info *uarg = NULL;
>>> >> > > u32 pkt_len = info->pkt_len;
>>> >> > > bool can_zcopy = false;
>>> >> > > + bool have_uref = false;
>>> >> > > u32 rest_len;
>>> >> > > int ret;
>>> >> > >
>>> >> > > @@ -360,6 +334,25 @@ static int
>>> >> > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> >> > > if (can_zcopy)
>>> >> > > max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>> >> > > (MAX_SKB_FRAGS * PAGE_SIZE));
>>> >> > > +
>>> >> > > + if (info->msg->msg_flags & MSG_ZEROCOPY &&
>>> >> > > + info->op == VIRTIO_VSOCK_OP_RW) {
>>> >> > > + uarg = info->msg->msg_ubuf;
>>> >> > > +
>>> >> > > + if (!uarg) {
>>> >> > > + uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> >> > > + pkt_len, NULL, false);
>>> >> > > + if (!uarg) {
>>> >> > > + virtio_transport_put_credit(vvs, pkt_len);
>>> >> > > + return -ENOMEM;
>>> >> > > + }
>>> >> > > +
>>> >> > > + if (!can_zcopy)
>>> >> > > + uarg_to_msgzc(uarg)->zerocopy = 0;
>>> >> > > +
>>> >> > > + have_uref = true;
>>> >> > > + }
>>> >> > > + }
>>> >> >
Hi guys! Just some replies after quick look:
>>> >> > Surely that block should only be done if can_zcopy is true?
I guess no, since TCP also allocates uarg even if zerocopy is impossible -
it just sets uarg_to_msgzc(uarg)->zerocopy = 0;
>>> >> > And shouldn't something unset it if info->op != VIRTIO_VSOCK_OP_RW ?
Hm, we can't enter block 'if (info->msg) {' when 'info->op' is not equal to
'VIRTIO_VSOCK_OP_RW',
because 'msg' is not NULL only for 'VIRTIO_VSOCK_OP_RW'. But anyway You point
to right thing - check for
'VIRTIO_VSOCK_OP_RW' could be removed here. Just with comment why.
>>> >> > If the msg_zerocopy_realloc() fails then can't you just set can_zcopy
>>> >> > to false.
Here I guess it is better to follow TCP way to make same behaviour, because
exact
logic for MSG_ZEROCOPY is not documented. TCP also returns error and stops tx
loop.
>>> >> >
>>> >> > It info->msg->msg_buf is already set then I think you have to disable
>>> >> > zero-copy.
>>> >> > The caller has already requested a callback - and you can't add
>>> >> > another.
In TCP implementation if 'msg_ubuf' is set they just use it and check for
zerocopy in
the same way as 'msg_ubuf' is NULL.
>>> >> >
>>> >> > In any case by the end of this can_zcopy and have_uref are really the
>>> >> > same flag.
>>> >>
Need to check it more. But 'can_zcopy' means that we fill frags in skb,
have_uref means that we
allocated completion (but it could be reported with not set
'SO_EE_CODE_ZEROCOPY_COPIED' if
'can_zcopy' was false).
@Stefano, I guess current implementation differs from TCP in two cases (at
least from first
view):
1) When 'msg_ubuf' is set: in TCP, already set 'msg_ubuf' is passed to
'skb_zerocopy_iter_stream()'
(if zerocopy is possible) where it is used as in vsock in call
'skb_zcopy_set()'. In vsock case if
'msg_ubuf' is not NULL we will pass just NULL to 'skb_zcopy_set()'. Yes this
is will be
no-effect call today (due to checks in 'skb_zcopy_set()'), but anyway - in
future may be not.
2) Also i see that 'skb_zerocopy_iter_stream()' in TCP version has some extra
checks which are
missed in vsock - we only just call '__zerocopy_sg_from_iter()' to fill skb
in zerocopy way.
But, I think, instead of trying to compare vsock and TCP versions best way is
to just copy
current TCP flow as close as possible:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/ipv4/tcp.c#n1144
1) Use same flow of checks for 'msg_flags', 'msg_ubuf', SOCK_ZEROCOPY etc.
2) To copy data we can use 'skb_zerocopy_iter_stream()'.
3) The only thing that we don't need in vsock is dev mem related code from TCP
implementation.
I can take this task, but pls need some time - may be two/three weeks due to
another tasks.
What do You think?
Thanks
>>> >> I kept the same approach we had before, trying to make as few changes as
>>> >> possible.
>>> >>
>>> >> All these potential issues seem to be pre-existing and should be
>>> >> eventually
>>> >> addressed in other patches IMHO. This patch one only resolves the main
>>> >> issue
>>> >> of calling `skb_zcopy_set()` for every skb to avoid leaking pages, etc.
>>> >
>>> >the patch is upstream now, right? So pretty much have to be patches on
>>> >top.
>>>
>>> If those are actual issues, then yes. TBH, I didn’t look into that
>>> aspect and left it as it was before. We should take a closer look at how
>>> MSG_ZEROCOPY is handled.
>>>
>>> David, if you think it needs fixing and you have time, feel free to send
>>> patches on top.
>>
>> I'm not fully sure how it all works.
>
> Same here, so I pinged Arseniy who worked on that, since it seemed deliberate
> to have `can_zcopy` (and set `uarg->zerocopy` accordingly) only when it was
> supported by the transport.
>
>> Especially the paths where msg->msg_ubuf is non-NULL, I suspect it should
>> be added to all the skb even if the ZEROCOPY flag isn't set.
>> I was just reading the one function.
>> But there did look like some very dodgy conditionals.
>
> I see, let's wait for Arseniy's feedback; otherwise, I'll try to fix it next
> week. As mentioned, this issue existed before this patch, so it shouldn't be
> a regression.
>
> Thanks,
> Stefano
>