> On 9 Oct 2025, at 4:02 PM, Daniel P. Berrangé <[email protected]> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Oct 09, 2025 at 10:14:18AM +0000, Tejus GK wrote:
>> From: Manish Mishra <[email protected]>
>> 
>> The kernel allocates extra metadata SKBs in case of a zerocopy send,
>> eventually used for zerocopy's notification mechanism. This metadata
>> memory is accounted for in the OPTMEM limit. The kernel queues
>> completion notifications on the socket error queue and this error queue
>> is freed when userspace reads it.
>> 
>> Usually, in the case of in-order processing, the kernel will batch the
>> notifications and merge the metadata into a single SKB and free the
>> rest. As a result, it never exceeds the OPTMEM limit. However, if there
>> is any out-of-order processing or intermittent zerocopy failures, this
>> error chain can grow significantly, exhausting the OPTMEM limit. As a
>> result, all new sendmsg requests fail to allocate any new SKB, leading
>> to an ENOBUF error. Depending on the amount of data queued before the
>> flush (i.e., large live migration iterations), even large OPTMEM limits
>> are prone to failure.
>> 
>> To work around this, if we encounter an ENOBUF error with a zerocopy
>> sendmsg, flush the error queue and retry once more.
>> 
>> Co-authored-by: Manish Mishra <[email protected]>
>> Signed-off-by: Tejus GK <[email protected]>
>> ---
>> include/io/channel-socket.h |  5 +++
>> io/channel-socket.c         | 78 ++++++++++++++++++++++++++++++-------
>> migration/multifd-nocomp.c  |  3 +-
>> migration/multifd.c         |  3 +-
>> 4 files changed, 72 insertions(+), 17 deletions(-)
>> 
>> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
>> index 26319fa98b..fcfd489c6c 100644
>> --- a/include/io/channel-socket.h
>> +++ b/include/io/channel-socket.h
>> @@ -50,6 +50,11 @@ struct QIOChannelSocket {
>>     ssize_t zero_copy_queued;
>>     ssize_t zero_copy_sent;
>>     bool blocking;
>> +    /**
>> +     * This flag indicates whether any new data was successfully sent with
>> +     * zerocopy since the last qio_channel_socket_flush() call.
>> +     */
>> +    bool new_zero_copy_sent_success;
>> };
>> 
>> 
>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>> index 8b30d5b7f7..22d59cd3cf 100644
>> --- a/io/channel-socket.c
>> +++ b/io/channel-socket.c
>> @@ -37,6 +37,12 @@
>> 
>> #define SOCKET_MAX_FDS 16
>> 
>> +#ifdef QEMU_MSG_ZEROCOPY
>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
>> +                                             bool block,
>> +                                             Error **errp);
>> +#endif
>> +
>> SocketAddress *
>> qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>>                                      Error **errp)
>> @@ -66,6 +72,7 @@ qio_channel_socket_new(void)
>>     sioc->zero_copy_queued = 0;
>>     sioc->zero_copy_sent = 0;
>>     sioc->blocking = false;
>> +    sioc->new_zero_copy_sent_success = FALSE;
>> 
>>     ioc = QIO_CHANNEL(sioc);
>>     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
>> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>>     size_t fdsize = sizeof(int) * nfds;
>>     struct cmsghdr *cmsg;
>>     int sflags = 0;
>> +    bool blocking = sioc->blocking;
>> +    bool zerocopy_flushed_once = false;
>> 
>>     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
>> 
>> @@ -663,10 +672,26 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
>> *ioc,
>>         case EINTR:
>>             goto retry;
>>         case ENOBUFS:
>> -            if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
>> -                error_setg_errno(errp, errno,
>> -                                 "Process can't lock enough memory for 
>> using MSG_ZEROCOPY");
>> -                return -1;
>> +            if (flags & (QIO_CHANNEL_WRITE_FLAG_ZERO_COPY |
>> +                QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE)) {
> 
> 
> There are only two callers where this patch has added use of
> QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE.
> 
> Both of them already sett QIO_CHANNEL_WRITE_FLAG_ZERO_COPY, so
> this code branch was already being taken
> 
> IOW, this new QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE looks
> pointless.
> 

The intention on adding QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE was to let 
callers decide if they want QEMU to silently flush and not throw an ENOBUFS.
Right now it fits for the migration use-case, since migration is the only 
caller using MSG_ZEROCOPY in the first place, but in case someone else decides 
to use MSG_ZEROCOPY, and wants the regular semantics of a write, i.e, expecting 
an ENOBUFS on a socket error queue pileup, they may choose not to pass the 
flag, rather than letting QEMU doing it unconditionally.

However, I might be overthinking this, if the usecase which I mention above 
doesn't make sense, I can drop this flag.


regards,
tejus

Reply via email to