> On 9 Oct 2025, at 9:40 PM, Daniel P. Berrangé <[email protected]> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Thu, Oct 09, 2025 at 11:51:51AM +0000, Tejus GK wrote: >> >> >>> On 9 Oct 2025, at 4:02 PM, Daniel P. Berrangé <[email protected]> wrote: >>> >>> !-------------------------------------------------------------------| >>> CAUTION: External Email >>> >>> |-------------------------------------------------------------------! >>> >>> On Thu, Oct 09, 2025 at 10:14:18AM +0000, Tejus GK wrote: >>>> From: Manish Mishra <[email protected]> >>>> >>>> The kernel allocates extra metadata SKBs in case of a zerocopy send, >>>> eventually used for zerocopy's notification mechanism. This metadata >>>> memory is accounted for in the OPTMEM limit. The kernel queues >>>> completion notifications on the socket error queue and this error queue >>>> is freed when userspace reads it. >>>> >>>> Usually, in the case of in-order processing, the kernel will batch the >>>> notifications and merge the metadata into a single SKB and free the >>>> rest. As a result, it never exceeds the OPTMEM limit. However, if there >>>> is any out-of-order processing or intermittent zerocopy failures, this >>>> error chain can grow significantly, exhausting the OPTMEM limit. As a >>>> result, all new sendmsg requests fail to allocate any new SKB, leading >>>> to an ENOBUF error. Depending on the amount of data queued before the >>>> flush (i.e., large live migration iterations), even large OPTMEM limits >>>> are prone to failure. >>>> >>>> To work around this, if we encounter an ENOBUF error with a zerocopy >>>> sendmsg, flush the error queue and retry once more. >>>> >>>> Co-authored-by: Manish Mishra <[email protected]> >>>> Signed-off-by: Tejus GK <[email protected]> >>>> --- >>>> include/io/channel-socket.h | 5 +++ >>>> io/channel-socket.c | 78 ++++++++++++++++++++++++++++++------- >>>> migration/multifd-nocomp.c | 3 +- >>>> migration/multifd.c | 3 +- >>>> 4 files changed, 72 insertions(+), 17 deletions(-) >>>> >>>> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h >>>> index 26319fa98b..fcfd489c6c 100644 >>>> --- a/include/io/channel-socket.h >>>> +++ b/include/io/channel-socket.h >>>> @@ -50,6 +50,11 @@ struct QIOChannelSocket { >>>> ssize_t zero_copy_queued; >>>> ssize_t zero_copy_sent; >>>> bool blocking; >>>> + /** >>>> + * This flag indicates whether any new data was successfully sent with >>>> + * zerocopy since the last qio_channel_socket_flush() call. >>>> + */ >>>> + bool new_zero_copy_sent_success; >>>> }; >>>> >>>> >>>> diff --git a/io/channel-socket.c b/io/channel-socket.c >>>> index 8b30d5b7f7..22d59cd3cf 100644 >>>> --- a/io/channel-socket.c >>>> +++ b/io/channel-socket.c >>>> @@ -37,6 +37,12 @@ >>>> >>>> #define SOCKET_MAX_FDS 16 >>>> >>>> +#ifdef QEMU_MSG_ZEROCOPY >>>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc, >>>> + bool block, >>>> + Error **errp); >>>> +#endif >>>> + >>>> SocketAddress * >>>> qio_channel_socket_get_local_address(QIOChannelSocket *ioc, >>>> Error **errp) >>>> @@ -66,6 +72,7 @@ qio_channel_socket_new(void) >>>> sioc->zero_copy_queued = 0; >>>> sioc->zero_copy_sent = 0; >>>> sioc->blocking = false; >>>> + sioc->new_zero_copy_sent_success = FALSE; >>>> >>>> ioc = QIO_CHANNEL(sioc); >>>> qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); >>>> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel >>>> *ioc, >>>> size_t fdsize = sizeof(int) * nfds; >>>> struct cmsghdr *cmsg; >>>> int sflags = 0; >>>> + bool blocking = sioc->blocking; >>>> + bool zerocopy_flushed_once = false; >>>> >>>> memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); >>>> >>>> @@ -663,10 +672,26 @@ static ssize_t qio_channel_socket_writev(QIOChannel >>>> *ioc, >>>> case EINTR: >>>> goto retry; >>>> case ENOBUFS: >>>> - if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { >>>> - error_setg_errno(errp, errno, >>>> - "Process can't lock enough memory for >>>> using MSG_ZEROCOPY"); >>>> - return -1; >>>> + if (flags & (QIO_CHANNEL_WRITE_FLAG_ZERO_COPY | >>>> + QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE)) { >>> >>> >>> There are only two callers where this patch has added use of >>> QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE. >>> >>> Both of them already sett QIO_CHANNEL_WRITE_FLAG_ZERO_COPY, so >>> this code branch was already being taken >>> >>> IOW, this new QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE looks >>> pointless. >> >> The intention on adding QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE >> was to let callers decide if they want QEMU to silently flush >> and not throw an ENOBUFS. > > This doesn't make any sense. sendmsg man page says that ENOBUFS > is never returned on Linux ordinarily. So AFAICT, the only time > we'll see ENOBUFS is if we used MSG_ZEROCOPY in the sendmsg > input. The MSG_ZEROCOPY flag for sendmsg is only set if we have > QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag set in the qio_channel_write > call. So again QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE appears > to serve no functional purpose. > >> Right now it fits for the migration use-case, since migration is >> the only caller using MSG_ZEROCOPY in the first place, but in case >> someone else decides to use MSG_ZEROCOPY, and wants the regular >> semantics of a write, i.e, expecting an ENOBUFS on a socket error >> queue pileup, they may choose not to pass the flag, rather than >> letting QEMU doing it unconditionally. > > This patch was previously addressing a functional bug in the current > code. Satisfying any new use case should be a completly separate > patch with justification in the commit message, as any single commit > should only address 1 problem at a time.
Ack, I shall drop the flag in the next revision. regards, tejus
