On Tue, Oct 21, 2025 at 04:16:48PM +0000, Tejus GK wrote:
> 
> 
> > On 13 Oct 2025, at 3:08 PM, Daniel P. Berrangé <[email protected]> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > On Mon, Oct 13, 2025 at 09:21:22AM +0000, Tejus GK wrote:
> >> From: Manish Mishra <[email protected]>
> >> 
> >> The kernel allocates extra metadata SKBs in case of a zerocopy send,
> >> eventually used for zerocopy's notification mechanism. This metadata
> >> memory is accounted for in the OPTMEM limit. The kernel queues
> >> completion notifications on the socket error queue and this error queue
> >> is freed when userspace reads it.
> >> 
> >> Usually, in the case of in-order processing, the kernel will batch the
> >> notifications and merge the metadata into a single SKB and free the
> >> rest. As a result, it never exceeds the OPTMEM limit. However, if there
> >> is any out-of-order processing or intermittent zerocopy failures, this
> >> error chain can grow significantly, exhausting the OPTMEM limit. As a
> >> result, all new sendmsg requests fail to allocate any new SKB, leading
> >> to an ENOBUF error. Depending on the amount of data queued before the
> >> flush (i.e., large live migration iterations), even large OPTMEM limits
> >> are prone to failure.
> >> 
> >> To work around this, if we encounter an ENOBUF error with a zerocopy
> >> sendmsg, flush the error queue and retry once more.
> >> 
> >> Co-authored-by: Manish Mishra <[email protected]>
> >> Signed-off-by: Tejus GK <[email protected]>
> >> ---
> >> include/io/channel-socket.h |  5 +++
> >> io/channel-socket.c         | 75 ++++++++++++++++++++++++++++++-------
> >> 2 files changed, 66 insertions(+), 14 deletions(-)
> >> 
> >> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> >> index 26319fa98b..fcfd489c6c 100644
> >> --- a/include/io/channel-socket.h
> >> +++ b/include/io/channel-socket.h
> >> @@ -50,6 +50,11 @@ struct QIOChannelSocket {
> >>     ssize_t zero_copy_queued;
> >>     ssize_t zero_copy_sent;
> >>     bool blocking;
> >> +    /**
> >> +     * This flag indicates whether any new data was successfully sent with
> >> +     * zerocopy since the last qio_channel_socket_flush() call.
> >> +     */
> >> +    bool new_zero_copy_sent_success;
> >> };
> >> 
> >> 
> >> diff --git a/io/channel-socket.c b/io/channel-socket.c
> >> index 8b30d5b7f7..7cd9f3666d 100644
> >> --- a/io/channel-socket.c
> >> +++ b/io/channel-socket.c
> >> @@ -37,6 +37,12 @@
> >> 
> >> #define SOCKET_MAX_FDS 16
> >> 
> >> +#ifdef QEMU_MSG_ZEROCOPY
> >> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
> >> +                                             bool block,
> >> +                                             Error **errp);
> >> +#endif
> >> +
> >> SocketAddress *
> >> qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> >>                                      Error **errp)
> >> @@ -66,6 +72,7 @@ qio_channel_socket_new(void)
> >>     sioc->zero_copy_queued = 0;
> >>     sioc->zero_copy_sent = 0;
> >>     sioc->blocking = false;
> >> +    sioc->new_zero_copy_sent_success = FALSE;
> >> 
> >>     ioc = QIO_CHANNEL(sioc);
> >>     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> >> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> >> *ioc,
> >>     size_t fdsize = sizeof(int) * nfds;
> >>     struct cmsghdr *cmsg;
> >>     int sflags = 0;
> >> +    bool blocking = sioc->blocking;
> >> +    bool zerocopy_flushed_once = false;
> >> 
> >>     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
> >> 
> >> @@ -664,9 +673,24 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> >> *ioc,
> >>             goto retry;
> >>         case ENOBUFS:
> >>             if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> >> -                error_setg_errno(errp, errno,
> >> -                                 "Process can't lock enough memory for 
> >> using MSG_ZEROCOPY");
> >> -                return -1;
> >> +                /**
> >> +                 * Socket error queueing may exhaust the OPTMEM limit. Try
> >> +                 * flushing the error queue once.
> >> +                 */
> >> +                if (!zerocopy_flushed_once) {
> >> +                    ret = qio_channel_socket_flush_internal(ioc, blocking,
> >> +                                                            errp);
> >> +                    if (ret < 0) {
> >> +                        return -1;
> >> +                    }
> >> +                    zerocopy_flushed_once = TRUE;
> >> +                    goto retry;
> >> +                } else {
> >> +                    error_setg_errno(errp, errno,
> >> +                                     "Process can't lock enough memory 
> >> for "
> >> +                                     "using MSG_ZEROCOPY");
> >> +                    return -1;
> >> +                }
> >>             }
> >>             break;
> >>         }
> >> @@ -777,8 +801,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
> >> *ioc,
> >> 
> >> 
> >> #ifdef QEMU_MSG_ZEROCOPY
> >> -static int qio_channel_socket_flush(QIOChannel *ioc,
> >> -                                    Error **errp)
> >> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
> >> +                                             bool block,
> >> +                                             Error **errp)
> >> {
> >>     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> >>     struct msghdr msg = {};
> >> @@ -786,7 +811,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
> >>     struct cmsghdr *cm;
> >>     char control[CMSG_SPACE(sizeof(*serr))];
> >>     int received;
> >> -    int ret;
> >> 
> >>     if (sioc->zero_copy_queued == sioc->zero_copy_sent) {
> >>         return 0;
> >> @@ -796,16 +820,20 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
> >>     msg.msg_controllen = sizeof(control);
> >>     memset(control, 0, sizeof(control));
> >> 
> >> -    ret = 1;
> >> -
> >>     while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
> >>         received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
> >>         if (received < 0) {
> >>             switch (errno) {
> >>             case EAGAIN:
> >> -                /* Nothing on errqueue, wait until something is available 
> >> */
> >> -                qio_channel_wait(ioc, G_IO_ERR);
> >> -                continue;
> >> +                if (block) {
> >> +                    /*
> >> +                     * Nothing on errqueue, wait until something is
> >> +                     * available.
> >> +                     */
> >> +                    qio_channel_wait(ioc, G_IO_ERR);
> >> +                    continue;
> > 
> > Why G_IO_ERR ?  If we're waiting for recvmsg() to become ready, then
> > it would need to be G_IO_IN we're waiting for.
> 
> Apologies for the delayed response. Please correct me if I am wrong, 
> https://docs.kernel.org/networking/msg_zerocopy.html#notification-reception
> mentions, that in order to poll for notifications on the socket error queue, 
> one need not set any flag in the events field, and the kernel in return would 
> set POLLERR in the output, when there’s eventually a message in the 
> notification queue.
> From what I understand, the glib equivalent for POLLERR, is G_IO_ERR, which 
> means we’d be waiting on the socket error queue until a notification comes up.

Ah I see. That's rather non-obvious. Can you put a comment in the code
to the effect that use of MSG_ERRQUEUE requires to you poll on G_IO_ERR
instead of the normal G_IO_IN.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Reply via email to