> On 9 Oct 2025, at 4:02 PM, Daniel P. Berrangé <[email protected]> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Thu, Oct 09, 2025 at 10:14:18AM +0000, Tejus GK wrote: >> From: Manish Mishra <[email protected]> >> >> The kernel allocates extra metadata SKBs in case of a zerocopy send, >> eventually used for zerocopy's notification mechanism. This metadata >> memory is accounted for in the OPTMEM limit. The kernel queues >> completion notifications on the socket error queue and this error queue >> is freed when userspace reads it. >> >> Usually, in the case of in-order processing, the kernel will batch the >> notifications and merge the metadata into a single SKB and free the >> rest. As a result, it never exceeds the OPTMEM limit. However, if there >> is any out-of-order processing or intermittent zerocopy failures, this >> error chain can grow significantly, exhausting the OPTMEM limit. As a >> result, all new sendmsg requests fail to allocate any new SKB, leading >> to an ENOBUF error. Depending on the amount of data queued before the >> flush (i.e., large live migration iterations), even large OPTMEM limits >> are prone to failure. >> >> To work around this, if we encounter an ENOBUF error with a zerocopy >> sendmsg, flush the error queue and retry once more. >> >> Co-authored-by: Manish Mishra <[email protected]> >> Signed-off-by: Tejus GK <[email protected]> >> --- >> include/io/channel-socket.h | 5 +++ >> io/channel-socket.c | 78 ++++++++++++++++++++++++++++++------- >> migration/multifd-nocomp.c | 3 +- >> migration/multifd.c | 3 +- >> 4 files changed, 72 insertions(+), 17 deletions(-) >> >> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h >> index 26319fa98b..fcfd489c6c 100644 >> --- a/include/io/channel-socket.h >> +++ b/include/io/channel-socket.h >> @@ -50,6 +50,11 @@ struct QIOChannelSocket { >> ssize_t zero_copy_queued; >> ssize_t zero_copy_sent; >> bool blocking; >> + /** >> + * This flag indicates whether any new data was successfully sent with >> + * zerocopy since the last qio_channel_socket_flush() call. >> + */ >> + bool new_zero_copy_sent_success; >> }; >> >> >> diff --git a/io/channel-socket.c b/io/channel-socket.c >> index 8b30d5b7f7..22d59cd3cf 100644 >> --- a/io/channel-socket.c >> +++ b/io/channel-socket.c >> @@ -37,6 +37,12 @@ >> >> #define SOCKET_MAX_FDS 16 >> >> +#ifdef QEMU_MSG_ZEROCOPY >> +static int qio_channel_socket_flush_internal(QIOChannel *ioc, >> + bool block, >> + Error **errp); >> +#endif >> + >> SocketAddress * >> qio_channel_socket_get_local_address(QIOChannelSocket *ioc, >> Error **errp) >> @@ -66,6 +72,7 @@ qio_channel_socket_new(void) >> sioc->zero_copy_queued = 0; >> sioc->zero_copy_sent = 0; >> sioc->blocking = false; >> + sioc->new_zero_copy_sent_success = FALSE; >> >> ioc = QIO_CHANNEL(sioc); >> qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); >> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc, >> size_t fdsize = sizeof(int) * nfds; >> struct cmsghdr *cmsg; >> int sflags = 0; >> + bool blocking = sioc->blocking; >> + bool zerocopy_flushed_once = false; >> >> memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); >> >> @@ -663,10 +672,26 @@ static ssize_t qio_channel_socket_writev(QIOChannel >> *ioc, >> case EINTR: >> goto retry; >> case ENOBUFS: >> - if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { >> - error_setg_errno(errp, errno, >> - "Process can't lock enough memory for >> using MSG_ZEROCOPY"); >> - return -1; >> + if (flags & (QIO_CHANNEL_WRITE_FLAG_ZERO_COPY | >> + QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE)) { > > > There are only two callers where this patch has added use of > QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE. > > Both of them already sett QIO_CHANNEL_WRITE_FLAG_ZERO_COPY, so > this code branch was already being taken > > IOW, this new QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE looks > pointless. >
The intention on adding QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE was to let callers decide if they want QEMU to silently flush and not throw an ENOBUFS. Right now it fits for the migration use-case, since migration is the only caller using MSG_ZEROCOPY in the first place, but in case someone else decides to use MSG_ZEROCOPY, and wants the regular semantics of a write, i.e, expecting an ENOBUFS on a socket error queue pileup, they may choose not to pass the flag, rather than letting QEMU doing it unconditionally. However, I might be overthinking this, if the usecase which I mention above doesn't make sense, I can drop this flag. regards, tejus
