> On 24 Oct 2025, at 5:49 PM, Daniel P. Berrangé <[email protected]> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Tue, Oct 21, 2025 at 04:16:48PM +0000, Tejus GK wrote: >> >> >>> On 13 Oct 2025, at 3:08 PM, Daniel P. Berrangé <[email protected]> wrote: >>> >>> !-------------------------------------------------------------------| >>> CAUTION: External Email >>> >>> |-------------------------------------------------------------------! >>> >>> On Mon, Oct 13, 2025 at 09:21:22AM +0000, Tejus GK wrote: >>>> From: Manish Mishra <[email protected]> >>>> >>>> The kernel allocates extra metadata SKBs in case of a zerocopy send, >>>> eventually used for zerocopy's notification mechanism. This metadata >>>> memory is accounted for in the OPTMEM limit. The kernel queues >>>> completion notifications on the socket error queue and this error queue >>>> is freed when userspace reads it. >>>> >>>> Usually, in the case of in-order processing, the kernel will batch the >>>> notifications and merge the metadata into a single SKB and free the >>>> rest. As a result, it never exceeds the OPTMEM limit. However, if there >>>> is any out-of-order processing or intermittent zerocopy failures, this >>>> error chain can grow significantly, exhausting the OPTMEM limit. As a >>>> result, all new sendmsg requests fail to allocate any new SKB, leading >>>> to an ENOBUF error. Depending on the amount of data queued before the >>>> flush (i.e., large live migration iterations), even large OPTMEM limits >>>> are prone to failure. >>>> >>>> To work around this, if we encounter an ENOBUF error with a zerocopy >>>> sendmsg, flush the error queue and retry once more. >>>> >>>> Co-authored-by: Manish Mishra <[email protected]> >>>> Signed-off-by: Tejus GK <[email protected]> >>>> --- >>>> include/io/channel-socket.h | 5 +++ >>>> io/channel-socket.c | 75 ++++++++++++++++++++++++++++++------- >>>> 2 files changed, 66 insertions(+), 14 deletions(-) >>>> >>>> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h >>>> index 26319fa98b..fcfd489c6c 100644 >>>> --- a/include/io/channel-socket.h >>>> +++ b/include/io/channel-socket.h >>>> @@ -50,6 +50,11 @@ struct QIOChannelSocket { >>>> ssize_t zero_copy_queued; >>>> ssize_t zero_copy_sent; >>>> bool blocking; >>>> + /** >>>> + * This flag indicates whether any new data was successfully sent with >>>> + * zerocopy since the last qio_channel_socket_flush() call. >>>> + */ >>>> + bool new_zero_copy_sent_success; >>>> }; >>>> >>>> >>>> diff --git a/io/channel-socket.c b/io/channel-socket.c >>>> index 8b30d5b7f7..7cd9f3666d 100644 >>>> --- a/io/channel-socket.c >>>> +++ b/io/channel-socket.c >>>> @@ -37,6 +37,12 @@ >>>> >>>> #define SOCKET_MAX_FDS 16 >>>> >>>> +#ifdef QEMU_MSG_ZEROCOPY >>>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc, >>>> + bool block, >>>> + Error **errp); >>>> +#endif >>>> + >>>> SocketAddress * >>>> qio_channel_socket_get_local_address(QIOChannelSocket *ioc, >>>> Error **errp) >>>> @@ -66,6 +72,7 @@ qio_channel_socket_new(void) >>>> sioc->zero_copy_queued = 0; >>>> sioc->zero_copy_sent = 0; >>>> sioc->blocking = false; >>>> + sioc->new_zero_copy_sent_success = FALSE; >>>> >>>> ioc = QIO_CHANNEL(sioc); >>>> qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); >>>> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel >>>> *ioc, >>>> size_t fdsize = sizeof(int) * nfds; >>>> struct cmsghdr *cmsg; >>>> int sflags = 0; >>>> + bool blocking = sioc->blocking; >>>> + bool zerocopy_flushed_once = false; >>>> >>>> memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)); >>>> >>>> @@ -664,9 +673,24 @@ static ssize_t qio_channel_socket_writev(QIOChannel >>>> *ioc, >>>> goto retry; >>>> case ENOBUFS: >>>> if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) { >>>> - error_setg_errno(errp, errno, >>>> - "Process can't lock enough memory for >>>> using MSG_ZEROCOPY"); >>>> - return -1; >>>> + /** >>>> + * Socket error queueing may exhaust the OPTMEM limit. Try >>>> + * flushing the error queue once. >>>> + */ >>>> + if (!zerocopy_flushed_once) { >>>> + ret = qio_channel_socket_flush_internal(ioc, blocking, >>>> + errp); >>>> + if (ret < 0) { >>>> + return -1; >>>> + } >>>> + zerocopy_flushed_once = TRUE; >>>> + goto retry; >>>> + } else { >>>> + error_setg_errno(errp, errno, >>>> + "Process can't lock enough memory >>>> for " >>>> + "using MSG_ZEROCOPY"); >>>> + return -1; >>>> + } >>>> } >>>> break; >>>> } >>>> @@ -777,8 +801,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel >>>> *ioc, >>>> >>>> >>>> #ifdef QEMU_MSG_ZEROCOPY >>>> -static int qio_channel_socket_flush(QIOChannel *ioc, >>>> - Error **errp) >>>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc, >>>> + bool block, >>>> + Error **errp) >>>> { >>>> QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc); >>>> struct msghdr msg = {}; >>>> @@ -786,7 +811,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc, >>>> struct cmsghdr *cm; >>>> char control[CMSG_SPACE(sizeof(*serr))]; >>>> int received; >>>> - int ret; >>>> >>>> if (sioc->zero_copy_queued == sioc->zero_copy_sent) { >>>> return 0; >>>> @@ -796,16 +820,20 @@ static int qio_channel_socket_flush(QIOChannel *ioc, >>>> msg.msg_controllen = sizeof(control); >>>> memset(control, 0, sizeof(control)); >>>> >>>> - ret = 1; >>>> - >>>> while (sioc->zero_copy_sent < sioc->zero_copy_queued) { >>>> received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE); >>>> if (received < 0) { >>>> switch (errno) { >>>> case EAGAIN: >>>> - /* Nothing on errqueue, wait until something is available >>>> */ >>>> - qio_channel_wait(ioc, G_IO_ERR); >>>> - continue; >>>> + if (block) { >>>> + /* >>>> + * Nothing on errqueue, wait until something is >>>> + * available. >>>> + */ >>>> + qio_channel_wait(ioc, G_IO_ERR); >>>> + continue; >>> >>> Why G_IO_ERR ? If we're waiting for recvmsg() to become ready, then >>> it would need to be G_IO_IN we're waiting for. >> >> Apologies for the delayed response. Please correct me if I am wrong, >> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.kernel.org_networking_msg-5Fzerocopy.html-23notification-2Dreception&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=v-j9kkKctOUcEWmsO1q0U5Gh8gvY15hrX0OGS1tEpJ0&m=iaYkvpxmwR4NUsJEXl9wTk6PrRBhMy0DqfuRqsLaPgvu9SlaZ0aK_Bg_AchDD34r&s=t020DODDta7ptJ_UWPqIShTbeXqzeN3NcIIRnk7_dD8&e= >> >> mentions, that in order to poll for notifications on the socket error queue, >> one need not set any flag in the events field, and the kernel in return >> would set POLLERR in the output, when there’s eventually a message in the >> notification queue. >> From what I understand, the glib equivalent for POLLERR, is G_IO_ERR, which >> means we’d be waiting on the socket error queue until a notification comes >> up. > > Ah I see. That's rather non-obvious. Can you put a comment in the code > to the effect that use of MSG_ERRQUEUE requires to you poll on G_IO_ERR > instead of the normal G_IO_IN.
Ack, I’ll add a comment in the next patch-series. Thanks! regards, tejus
