> On 24 Oct 2025, at 5:49 PM, Daniel P. Berrangé <[email protected]> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Tue, Oct 21, 2025 at 04:16:48PM +0000, Tejus GK wrote:
>> 
>> 
>>> On 13 Oct 2025, at 3:08 PM, Daniel P. Berrangé <[email protected]> wrote:
>>> 
>>> !-------------------------------------------------------------------|
>>> CAUTION: External Email
>>> 
>>> |-------------------------------------------------------------------!
>>> 
>>> On Mon, Oct 13, 2025 at 09:21:22AM +0000, Tejus GK wrote:
>>>> From: Manish Mishra <[email protected]>
>>>> 
>>>> The kernel allocates extra metadata SKBs in case of a zerocopy send,
>>>> eventually used for zerocopy's notification mechanism. This metadata
>>>> memory is accounted for in the OPTMEM limit. The kernel queues
>>>> completion notifications on the socket error queue and this error queue
>>>> is freed when userspace reads it.
>>>> 
>>>> Usually, in the case of in-order processing, the kernel will batch the
>>>> notifications and merge the metadata into a single SKB and free the
>>>> rest. As a result, it never exceeds the OPTMEM limit. However, if there
>>>> is any out-of-order processing or intermittent zerocopy failures, this
>>>> error chain can grow significantly, exhausting the OPTMEM limit. As a
>>>> result, all new sendmsg requests fail to allocate any new SKB, leading
>>>> to an ENOBUF error. Depending on the amount of data queued before the
>>>> flush (i.e., large live migration iterations), even large OPTMEM limits
>>>> are prone to failure.
>>>> 
>>>> To work around this, if we encounter an ENOBUF error with a zerocopy
>>>> sendmsg, flush the error queue and retry once more.
>>>> 
>>>> Co-authored-by: Manish Mishra <[email protected]>
>>>> Signed-off-by: Tejus GK <[email protected]>
>>>> ---
>>>> include/io/channel-socket.h |  5 +++
>>>> io/channel-socket.c         | 75 ++++++++++++++++++++++++++++++-------
>>>> 2 files changed, 66 insertions(+), 14 deletions(-)
>>>> 
>>>> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
>>>> index 26319fa98b..fcfd489c6c 100644
>>>> --- a/include/io/channel-socket.h
>>>> +++ b/include/io/channel-socket.h
>>>> @@ -50,6 +50,11 @@ struct QIOChannelSocket {
>>>>    ssize_t zero_copy_queued;
>>>>    ssize_t zero_copy_sent;
>>>>    bool blocking;
>>>> +    /**
>>>> +     * This flag indicates whether any new data was successfully sent with
>>>> +     * zerocopy since the last qio_channel_socket_flush() call.
>>>> +     */
>>>> +    bool new_zero_copy_sent_success;
>>>> };
>>>> 
>>>> 
>>>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>>>> index 8b30d5b7f7..7cd9f3666d 100644
>>>> --- a/io/channel-socket.c
>>>> +++ b/io/channel-socket.c
>>>> @@ -37,6 +37,12 @@
>>>> 
>>>> #define SOCKET_MAX_FDS 16
>>>> 
>>>> +#ifdef QEMU_MSG_ZEROCOPY
>>>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
>>>> +                                             bool block,
>>>> +                                             Error **errp);
>>>> +#endif
>>>> +
>>>> SocketAddress *
>>>> qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>>>>                                     Error **errp)
>>>> @@ -66,6 +72,7 @@ qio_channel_socket_new(void)
>>>>    sioc->zero_copy_queued = 0;
>>>>    sioc->zero_copy_sent = 0;
>>>>    sioc->blocking = false;
>>>> +    sioc->new_zero_copy_sent_success = FALSE;
>>>> 
>>>>    ioc = QIO_CHANNEL(sioc);
>>>>    qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
>>>> @@ -618,6 +625,8 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
>>>> *ioc,
>>>>    size_t fdsize = sizeof(int) * nfds;
>>>>    struct cmsghdr *cmsg;
>>>>    int sflags = 0;
>>>> +    bool blocking = sioc->blocking;
>>>> +    bool zerocopy_flushed_once = false;
>>>> 
>>>>    memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
>>>> 
>>>> @@ -664,9 +673,24 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
>>>> *ioc,
>>>>            goto retry;
>>>>        case ENOBUFS:
>>>>            if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
>>>> -                error_setg_errno(errp, errno,
>>>> -                                 "Process can't lock enough memory for 
>>>> using MSG_ZEROCOPY");
>>>> -                return -1;
>>>> +                /**
>>>> +                 * Socket error queueing may exhaust the OPTMEM limit. Try
>>>> +                 * flushing the error queue once.
>>>> +                 */
>>>> +                if (!zerocopy_flushed_once) {
>>>> +                    ret = qio_channel_socket_flush_internal(ioc, blocking,
>>>> +                                                            errp);
>>>> +                    if (ret < 0) {
>>>> +                        return -1;
>>>> +                    }
>>>> +                    zerocopy_flushed_once = TRUE;
>>>> +                    goto retry;
>>>> +                } else {
>>>> +                    error_setg_errno(errp, errno,
>>>> +                                     "Process can't lock enough memory 
>>>> for "
>>>> +                                     "using MSG_ZEROCOPY");
>>>> +                    return -1;
>>>> +                }
>>>>            }
>>>>            break;
>>>>        }
>>>> @@ -777,8 +801,9 @@ static ssize_t qio_channel_socket_writev(QIOChannel 
>>>> *ioc,
>>>> 
>>>> 
>>>> #ifdef QEMU_MSG_ZEROCOPY
>>>> -static int qio_channel_socket_flush(QIOChannel *ioc,
>>>> -                                    Error **errp)
>>>> +static int qio_channel_socket_flush_internal(QIOChannel *ioc,
>>>> +                                             bool block,
>>>> +                                             Error **errp)
>>>> {
>>>>    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
>>>>    struct msghdr msg = {};
>>>> @@ -786,7 +811,6 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
>>>>    struct cmsghdr *cm;
>>>>    char control[CMSG_SPACE(sizeof(*serr))];
>>>>    int received;
>>>> -    int ret;
>>>> 
>>>>    if (sioc->zero_copy_queued == sioc->zero_copy_sent) {
>>>>        return 0;
>>>> @@ -796,16 +820,20 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
>>>>    msg.msg_controllen = sizeof(control);
>>>>    memset(control, 0, sizeof(control));
>>>> 
>>>> -    ret = 1;
>>>> -
>>>>    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
>>>>        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
>>>>        if (received < 0) {
>>>>            switch (errno) {
>>>>            case EAGAIN:
>>>> -                /* Nothing on errqueue, wait until something is available 
>>>> */
>>>> -                qio_channel_wait(ioc, G_IO_ERR);
>>>> -                continue;
>>>> +                if (block) {
>>>> +                    /*
>>>> +                     * Nothing on errqueue, wait until something is
>>>> +                     * available.
>>>> +                     */
>>>> +                    qio_channel_wait(ioc, G_IO_ERR);
>>>> +                    continue;
>>> 
>>> Why G_IO_ERR ?  If we're waiting for recvmsg() to become ready, then
>>> it would need to be G_IO_IN we're waiting for.
>> 
>> Apologies for the delayed response. Please correct me if I am wrong, 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.kernel.org_networking_msg-5Fzerocopy.html-23notification-2Dreception&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=v-j9kkKctOUcEWmsO1q0U5Gh8gvY15hrX0OGS1tEpJ0&m=iaYkvpxmwR4NUsJEXl9wTk6PrRBhMy0DqfuRqsLaPgvu9SlaZ0aK_Bg_AchDD34r&s=t020DODDta7ptJ_UWPqIShTbeXqzeN3NcIIRnk7_dD8&e=
>>  
>> mentions, that in order to poll for notifications on the socket error queue, 
>> one need not set any flag in the events field, and the kernel in return 
>> would set POLLERR in the output, when there’s eventually a message in the 
>> notification queue.
>> From what I understand, the glib equivalent for POLLERR, is G_IO_ERR, which 
>> means we’d be waiting on the socket error queue until a notification comes 
>> up.
> 
> Ah I see. That's rather non-obvious. Can you put a comment in the code
> to the effect that use of MSG_ERRQUEUE requires to you poll on G_IO_ERR
> instead of the normal G_IO_IN.

Ack, I’ll add a comment in the next patch-series. Thanks!


regards,
tejus

Reply via email to