On Tue, Oct 21, 2025 at 04:16:48PM +0000, Tejus GK wrote: > > > > On 13 Oct 2025, at 3:08 PM, Daniel P. Berrangé <[email protected]> wrote: > > > > !-------------------------------------------------------------------| > > CAUTION: External Email > > > > |-------------------------------------------------------------------! > > > > On Mon, Oct 13, 2025 at 09:21:22AM +0000, Tejus GK wrote: > >> From: Manish Mishra <[email protected]> > >> > >> The kernel allocates extra metadata SKBs in case of a zerocopy send, > >> eventually used for zerocopy's notification mechanism. This metadata > >> memory is accounted for in the OPTMEM limit. The kernel queues > >> completion notifications on the socket error queue and this error queue > >> is freed when userspace reads it. > >> > >> Usually, in the case of in-order processing, the kernel will batch the > >> notifications and merge the metadata into a single SKB and free the > >> rest. As a result, it never exceeds the OPTMEM limit. However, if there > >> is any out-of-order processing or intermittent zerocopy failures, this > >> error chain can grow significantly, exhausting the OPTMEM limit. As a > >> result, all new sendmsg requests fail to allocate any new SKB, leading > >> to an ENOBUF error. Depending on the amount of data queued before the > >> flush (i.e., large live migration iterations), even large OPTMEM limits > >> are prone to failure. > >> > >> To work around this, if we encounter an ENOBUF error with a zerocopy > >> sendmsg, flush the error queue and retry once more. > >> > >> Co-authored-by: Manish Mishra <[email protected]> > >> Signed-off-by: Tejus GK <[email protected]> > >> --- > >> include/io/channel-socket.h | 5 +++ > >> io/channel-socket.c | 75 ++++++++++++++++++++++++++++++------- > >> 2 files changed, 66 insertions(+), 14 deletions(-) > >> > >> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h > >> index 26319fa98b..fcfd489c6c 100644 > >> --- a/include/io/channel-socket.h > >> +++ b/include/io/channel-socket.h > >> @@ -50,6 +50,11 @@ struct QIOChannelSocket { > >> ssize_t zero_copy_queued; > >> ssize_t zero_copy_sent; > >> bool blocking; > >> + /** > >> + * This flag indicates whether any new data was successfully sent with > >> + * zerocopy since the last qio_channel_socket_flush() call. > >> + */ > >> + bool new_zero_copy_sent_success; > >> }; > >> > >> > >> diff --git a/io/channel-socket.c b/io/channel-socket.c > >> index 8b30d5b7f7..7cd9f3666d 100644 > >> --- a/io/channel-socket.c > >> +++ b/io/channel-socket.c > >> @@ -37,6 +37,12 @@ > >> > >> #define SOCKET_MAX_FDS 16 > >> > >> +#ifdef QEMU_MSG_ZEROCOPY > >> +static int qio_channel_socket_flush_internal(QIOChannel *ioc, > >> + bool block, > >> + Error **errp); > >> +#endif > >> + > >> SocketAddress * > >> qio_channel_socket_get_local_address(QIOChannelSocket *ioc, > >> Error **errp) > >> @@ -66,6 +72,7 @@ qio_channel_socket_new(void) > >> sioc->zero_copy_queued = 0; > >> sioc->zero_copy_sent = 0; > >> sioc->blocking = false; > >> + sioc->new_zero_copy_sent_success = FALSE; > >> > >> ioc = QIO_CHANNEL(sioc); > >> qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); > >> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel > >> *ioc, > >> size_t fdsize = sizeof(int) * nfds; > >> struct cmsghdr *cmsg; > >> int sflags = 0; > >> + bool blocking = sioc->blocking; > >> + bool zerocopy_flushed_once = false; > >> > >> memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); > >> > >> @@ -664,9 +673,24 @@ static ssize_t qio_channel_socket_writev(QIOChannel > >> *ioc, > >> goto retry; > >> case ENOBUFS: > >> if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { > >> - error_setg_errno(errp, errno, > >> - "Process can't lock enough memory for > >> using MSG_ZEROCOPY"); > >> - return -1; > >> + /** > >> + * Socket error queueing may exhaust the OPTMEM limit. Try > >> + * flushing the error queue once. > >> + */ > >> + if (!zerocopy_flushed_once) { > >> + ret = qio_channel_socket_flush_internal(ioc, blocking, > >> + errp); > >> + if (ret < 0) { > >> + return -1; > >> + } > >> + zerocopy_flushed_once = TRUE; > >> + goto retry; > >> + } else { > >> + error_setg_errno(errp, errno, > >> + "Process can't lock enough memory > >> for " > >> + "using MSG_ZEROCOPY"); > >> + return -1; > >> + } > >> } > >> break; > >> } > >> @@ -777,8 +801,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel > >> *ioc, > >> > >> > >> #ifdef QEMU_MSG_ZEROCOPY > >> -static int qio_channel_socket_flush(QIOChannel *ioc, > >> - Error **errp) > >> +static int qio_channel_socket_flush_internal(QIOChannel *ioc, > >> + bool block, > >> + Error **errp) > >> { > >> QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); > >> struct msghdr msg = {}; > >> @@ -786,7 +811,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc, > >> struct cmsghdr *cm; > >> char control[CMSG_SPACE(sizeof(*serr))]; > >> int received; > >> - int ret; > >> > >> if (sioc->zero_copy_queued == sioc->zero_copy_sent) { > >> return 0; > >> @@ -796,16 +820,20 @@ static int qio_channel_socket_flush(QIOChannel *ioc, > >> msg.msg_controllen = sizeof(control); > >> memset(control, 0, sizeof(control)); > >> > >> - ret = 1; > >> - > >> while (sioc->zero_copy_sent < sioc->zero_copy_queued) { > >> received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE); > >> if (received < 0) { > >> switch (errno) { > >> case EAGAIN: > >> - /* Nothing on errqueue, wait until something is available > >> */ > >> - qio_channel_wait(ioc, G_IO_ERR); > >> - continue; > >> + if (block) { > >> + /* > >> + * Nothing on errqueue, wait until something is > >> + * available. > >> + */ > >> + qio_channel_wait(ioc, G_IO_ERR); > >> + continue; > > > > Why G_IO_ERR ? If we're waiting for recvmsg() to become ready, then > > it would need to be G_IO_IN we're waiting for. > > Apologies for the delayed response. Please correct me if I am wrong, > https://docs.kernel.org/networking/msg_zerocopy.html#notification-reception > mentions, that in order to poll for notifications on the socket error queue, > one need not set any flag in the events field, and the kernel in return would > set POLLERR in the output, when there’s eventually a message in the > notification queue. > From what I understand, the glib equivalent for POLLERR, is G_IO_ERR, which > means we’d be waiting on the socket error queue until a notification comes up.
Ah I see. That's rather non-obvious. Can you put a comment in the code to the effect that use of MSG_ERRQUEUE requires to you poll on G_IO_ERR instead of the normal G_IO_IN. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
