> On 9 Oct 2025, at 9:40 PM, Daniel P. Berrangé <[email protected]> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Oct 09, 2025 at 11:51:51AM +0000, Tejus GK wrote:
>> 
>> 
>>> On 9 Oct 2025, at 4:02 PM, Daniel P. Berrangé <[email protected]> wrote:
>>> 
>>> !-------------------------------------------------------------------|
>>> CAUTION: External Email
>>> 
>>> |-------------------------------------------------------------------!
>>> 
>>> On Thu, Oct 09, 2025 at 10:14:18AM +0000, Tejus GK wrote:
>>>> From: Manish Mishra <[email protected]>
>>>> 
>>>> The kernel allocates extra metadata SKBs in case of a zerocopy send,
>>>> eventually used for zerocopy's notification mechanism. This metadata
>>>> memory is accounted for in the OPTMEM limit. The kernel queues
>>>> completion notifications on the socket error queue and this error queue
>>>> is freed when userspace reads it.
>>>> 
>>>> Usually, in the case of in-order processing, the kernel will batch the
>>>> notifications and merge the metadata into a single SKB and free the
>>>> rest. As a result, it never exceeds the OPTMEM limit. However, if there
>>>> is any out-of-order processing or intermittent zerocopy failures, this
>>>> error chain can grow significantly, exhausting the OPTMEM limit. As a
>>>> result, all new sendmsg requests fail to allocate any new SKB, leading
>>>> to an ENOBUF error. Depending on the amount of data queued before the
>>>> flush (i.e., large live migration iterations), even large OPTMEM limits
>>>> are prone to failure.
>>>> 
>>>> To work around this, if we encounter an ENOBUF error with a zerocopy
>>>> sendmsg, flush the error queue and retry once more.
>>>> 
>>>> Co-authored-by: Manish Mishra <[email protected]>
>>>> Signed-off-by: Tejus GK <[email protected]>
>>>> ---
>>>> include/io/channel-socket.h |  5 +++
>>>> io/channel-socket.c         | 78 ++++++++++++++++++++++++++++++-------
>>>> migration/multifd-nocomp.c  |  3 +-
>>>> migration/multifd.c         |  3 +-
>>>> 4 files changed, 72 insertions(+), 17 deletions(-)
>>>> 
>>>> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
>>>> index 26319fa98b..fcfd489c6c 100644
>>>> --- a/include/io/channel-socket.h
>>>> +++ b/include/io/channel-socket.h
>>>> @@ -50,6 +50,11 @@ struct QIOChannelSocket {
>>>>    ssize_t zero_copy_queued;
>>>>    ssize_t zero_copy_sent;
>>>>    bool blocking;
>>>> +    /**
>>>> +     * This flag indicates whether any new data was successfully sent with
>>>> +     * zerocopy since the last qio_channel_socket_flush() call.
>>>> +     */
>>>> +    bool new_zero_copy_sent_success;
>>>> };
>>>> 
>>>> 
>>>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>>>> index 8b30d5b7f7..22d59cd3cf 100644
>>>> --- a/io/channel-socket.c
>>>> +++ b/io/channel-socket.c
>>>> @@ -37,6 +37,12 @@
>>>> 
>>>> #define SOCKET_MAX_FDS 16
>>>> 
>>>> +#ifdef QEMU_MSG_ZEROCOPY
>>>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
>>>> +                                             bool block,
>>>> +                                             Error **errp);
>>>> +#endif
>>>> +
>>>> SocketAddress *
>>>> qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>>>>                                     Error **errp)
>>>> @@ -66,6 +72,7 @@ qio_channel_socket_new(void)
>>>>    sioc->zero_copy_queued = 0;
>>>>    sioc->zero_copy_sent = 0;
>>>>    sioc->blocking = false;
>>>> +    sioc->new_zero_copy_sent_success = FALSE;
>>>> 
>>>>    ioc = QIO_CHANNEL(sioc);
>>>>    qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
>>>> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
>>>> *ioc,
>>>>    size_t fdsize = sizeof(int) * nfds;
>>>>    struct cmsghdr *cmsg;
>>>>    int sflags = 0;
>>>> +    bool blocking = sioc->blocking;
>>>> +    bool zerocopy_flushed_once = false;
>>>> 
>>>>    memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
>>>> 
>>>> @@ -663,10 +672,26 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
>>>> *ioc,
>>>>        case EINTR:
>>>>            goto retry;
>>>>        case ENOBUFS:
>>>> -            if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
>>>> -                error_setg_errno(errp, errno,
>>>> -                                 "Process can't lock enough memory for 
>>>> using MSG_ZEROCOPY");
>>>> -                return -1;
>>>> +            if (flags & (QIO_CHANNEL_WRITE_FLAG_ZERO_COPY |
>>>> +                QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE)) {
>>> 
>>> 
>>> There are only two callers where this patch has added use of
>>> QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE.
>>> 
>>> Both of them already sett QIO_CHANNEL_WRITE_FLAG_ZERO_COPY, so
>>> this code branch was already being taken
>>> 
>>> IOW, this new QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE looks
>>> pointless.
>> 
>> The intention on adding QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE
>> was to let callers decide if they want QEMU to silently flush
>> and not throw an ENOBUFS.
> 
> This doesn't make any sense. sendmsg man page says that ENOBUFS
> is never returned on Linux ordinarily. So AFAICT, the only time
> we'll see ENOBUFS is if we used MSG_ZEROCOPY in the sendmsg
> input. The MSG_ZEROCOPY flag for sendmsg is only set if we have
> QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag set in the qio_channel_write
> call. So again QIO_CHANNEL_WRITE_FLAG_ZERO_COPY_FLUSH_ONCE appears
> to serve no functional purpose.
> 
>> Right now it fits for the migration use-case, since migration is
>> the only caller using MSG_ZEROCOPY in the first place, but in case
>> someone else decides to use MSG_ZEROCOPY, and wants the regular
>> semantics of a write, i.e, expecting an ENOBUFS on a socket error
>> queue pileup, they may choose not to pass the flag, rather than
>> letting QEMU doing it unconditionally.
> 
> This patch was previously addressing a functional bug in the current
> code.  Satisfying any new use case should be a completly separate
> patch with justification in the commit message, as any single commit
> should only address 1 problem at a time.

Ack, I shall drop the flag in the next revision. 


regards,
tejus

Reply via email to