On Wed, Nov 05, 2025 at 03:57:29PM -0600, Eric Blake wrote:
> On Tue, Nov 04, 2025 at 11:13:48AM +0000, Daniel P. Berrangé wrote:
> > On Mon, Nov 03, 2025 at 02:10:57PM -0600, Eric Blake wrote:
> > > The point of QIONetListener is to allow a server to listen to more
> > > than one socket address at a time, and respond to clients in a
> > > first-come-first-serve order across any of those addresses. While
> > > some servers (like NBD) really do want to serve multiple simultaneous
> > > clients, many other servers only care about the first client to
> > > connect, and will immediately deregister the callback, possibly by
> > > dropping their reference to the QIONetListener. The existing code
> > > ensures that all other pending callbacks remain safe once the first
> > > callback drops the listener, by adding an extra reference to the
> > > listener for each GSource created, where those references pair to the
> > > eventual teardown of each GSource after a given callbacks has been
> > > serviced or aborted. But it is equally acceptable to hoist the
> > > reference to the listener outside the loop - as long as there is a
> > > callback function registered, it is sufficient to have a single
> > > reference live for the entire array of sioc, rather than one reference
> > > per sioc in the array.
> > >
> > > Hoisting the reference like this will make it easier for an upcoming
> > > patch to still ensure the listener cannot be prematurely garbage
> > > collected during the user's callback, even when the callback no longer
> > > uses a per-sioc GSource.
> >
> > It isn't quite this simple. Glib reference counts the callback
> > func / data, holding a reference when dispatching the callback.
> >
> > IOW, even if the GSource is unrefed, the callback 'notify'
> > function won't be called if the main loop is in the process
> > of dispatching.
>
> I'm not sure I follow your argument. Glib holds a reference on the
> GSource object, not on the opaque data that is handed to the GSource.
> It is possible to use g_source_set_callback_indirect() where GSource
> can learn how to use the same reference counting on data as external
> code, by the use of function pointers for ref and unref, but QIO uses
> merely g_source_set_callback().
> https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gmain.c#L1844
> shows that glib then wraps that opaque pointer into an internal
> GSourceCallback object which itself is reference counted, so that the
> notify function is not called until the GSource is finalized, but that
> is reference counting on the container, not on the opaque object
> itself (which in this patch is the QIONetListener).
>
> >
> > With this change, the reference on 'listener' can now be
> > released even if the callback is currently dispatching.
>
> So if I'm understanding your concern, you're worried that the unwatch
> code can finish looping through the g_source_destroy and then reach
> the point where it unrefs listener, but that a late-breaking client
> connection can trigger a callback that can still be executing in
> another thread/coroutine after the listener is unref'ed but before the
> GSource has been finalized? If so, would squashing this in fix the
> problem you are seeing?
Consider the following scenario, where we have two threads, one
is calling QIONetListener APIs, and one is the event thread.
In the current code, when we unref() the GSource for the socket
watch, the destroy-notify does not get called, because the
event thread is in the middle of a dispatch callback for the
I/O event. When the dispatch callback returns control to the
event loop, the GSourceCallback is unrefed, and this triggers
the destroy-notify call, which unrefs the listener.
The flow looks like this:
Thread 1:
qio_net_listener_set_client_func(lstnr, f, ...);
=> foreach sock: socket
=> object_ref(lstnr)
=> sock_src = qio_channel_socket_add_watch_source(sock, ....,
lstnr, object_unref);
Thread 2:
poll()
=> event POLLIN on socket
=> ref(GSourceCallback)
=> call dispatch(sock)
...do stuff..
Thread 1:
qio_net_listener_set_client_func(lstnr, NULL, ...);
=> foreach sock: socket
=> g_source_unref(sock_src)
unref(lstnr) (the final reference)
=> finalize(lstnr)
Thread 2:
=> return dispatch(sock)
=> unref(GSourceCallback)
=> destroy-notify
=> object_unref
> diff --git i/io/net-listener.c w/io/net-listener.c
> index 9f4e3c0be0c..1fcbbeb7a76 100644
> --- i/io/net-listener.c
> +++ w/io/net-listener.c
> @@ -67,8 +67,10 @@ static void qio_net_listener_aio_func(void *opaque)
> {
> QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(opaque);
>
> + object_ref(OBJECT(sioc->listener));
> qio_net_listener_channel_func(QIO_CHANNEL(sioc), G_IO_IN,
> sioc->listener);
> + object_unref(OBJECT(sioc->listener));
> }
Now consider this patch, plus this extra hunk...
Thread 1:
qio_net_listener_set_client_func(lstnr, f, ...);
=> object_ref(listener)
=> foreach sock: socket
=> sock_src = qio_channel_socket_add_watch_source(sock, ....,
lstnr, NULL);
Thread 2:
poll()
=> event POLLIN on socket
=> call dispatch(sock)
=> ref(lstnr)
...do stuff..
Thread 1:
qio_net_listener_set_client_func(lstnr, NULL, ...);
=> foreach sock: socket
=> g_source_unref(sock_src)
=> object_unref(listener)
unref(lstnr) (still 1 reference left)
Thread 2:
=> unref(lstnr) (the final reference)
=> finalize(lstnr)
=> return dispatch(sock)
=> unref(GSourceCallback)
That appears to work ok, however, there's still a race window that is
not solved. Between the time thread 2 sees POLLIN, and when it calls
the dispatch(sock) function, it is possible that thread 1 will drop
the last reference:
Thread 1:
qio_net_listener_set_client_func(lstnr, f, ...);
=> object_ref(listener)
=> foreach sock: socket
=> sock_src = qio_channel_socket_add_watch_source(sock, ....,
lstnr, NULL);
Thread 2:
poll()
=> event POLLIN on socket
Thread 1:
qio_net_listener_set_client_func(lstnr, NULL, ...);
=> foreach sock: socket
=> g_source_unref(sock_src)
=> object_unref(listener)
unref(lstnr) (still 1 reference left)
Thread 2:
=> call dispatch(sock)
=> ref(lstnr)
...do stuff..
=> unref(lstnr) (the final reference)
=> finalize(lstnr)
=> return dispatch(sock)
=> unref(GSourceCallback)
I don't see a way to solve this without synchronization with the event
loop for releasing the reference on the opaque data for the dispatcher
callback. That's what the current code does, but I'm seeing no way for
the AioContext event loop callbacks to have anything equivalent. This
feels like a gap in the AioContext design.
This is admittedly an incredibly hard to trigger race condition. It would
need a client to be calling a QMP command that tears down the NBD server,
at the exact same time as a new NBD client was incoming. Or the same kind
of scenario for other pieces of QEMU code using QIONetListener. This still
makes me worried though, as rare races have a habit of hitting QEMU
eventually.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|