On Tue, Nov 11, 2025 at 01:09:24PM -0600, Eric Blake wrote:
> > In the current code, when we unref() the GSource for the socket
> > watch, the destroy-notify does not get called, because the
> > event thread is in the middle of a dispatch callback for the
> > I/O event.  When the dispatch callback returns control to the
> > event loop, the GSourceCallback is unrefed, and this triggers
> > the destroy-notify call, which unrefs the listener.
> > 
> > The flow looks like this:
> > 
> >   Thread 1:
> >        qio_net_listener_set_client_func(lstnr, f, ...);
> >            => foreach sock: socket
> >                => object_ref(lstnr)
> >                => sock_src = qio_channel_socket_add_watch_source(sock, 
> > ...., lstnr, object_unref);
> > 
> >   Thread 2:
> >        poll()
> >           => event POLLIN on socket
> >                => ref(GSourceCallback)
> >                => call dispatch(sock)
> >                     ...do stuff..
> > 
> >   Thread 1:
> >        qio_net_listener_set_client_func(lstnr, NULL, ...);
> >           => foreach sock: socket
> >                => g_source_unref(sock_src)
> >        unref(lstnr)  (the final reference)
> >           => finalize(lstnr)
> 
> If I understand correctly, _this_ unref(lstnr) is NOT the final
> reference, because of the additional reference still owned by the
> GSource that is still dispatching, so finalize(lstnr) is not reached
> here...
> 
> > 
> >   Thread 2:
> >               => return dispatch(sock)
> >               => unref(GSourceCallback)
> >                   => destroy-notify
> >                      => object_unref
> 
> ...but instead here.  And that is the desirable property of the
> pre-patch behavior with a per-GSource reference on the listsner object
> - we are guaranteed that listener can't be finalized while there are
> any pending dispatch in flight, even if the caller has unref'd their
> last mention of lstnr, and therefore the dispatch never has a
> use-after-free.

But thinking further, even if it is not an object_unref, but merely a
change of the async callback function, is this the sort of thing where
a mutex is needed to make things safer?

Even without my patch series, we have this scenario:

  Thread 1:
       qio_net_listener_set_client_func(lstnr, f1, ...);
           => foreach sock: socket
               => object_ref(lstnr)
               => sock_src = qio_channel_socket_add_watch_source(sock, ...., 
lstnr, object_unref);

  Thread 2:
       poll()
          => event POLLIN on socket
               => ref(GSourceCallback) // while lstnr->io_func is f1
                    ...do stuff..

  Thread 1:
       qio_net_listener_set_client_func(lstnr, f2, ...);
          => foreach sock: socket
               => g_source_unref(sock_src)
          => foreach sock: socket
               => object_ref(lstnr)
               => sock_src = qio_channel_socket_add_watch_source(sock, ...., 
lstnr, object_unref);

  Thread 2:
               => access lstnr->io_func // now sees f2
               => call f2(sock)
               => return dispatch(sock)
               => unref(GSourceCallback)
                  => destroy-notify
                     => object_unref

So even though thread 2 noticed a client while f1 was registered,
because we lack a mutex and are not using atomic access primitives,
there is nothing preventing thread 1 from changing the io_func from f1
to f2 before thread 2 finally calls the callback.  And if f2 is NULL
(because the caller is deregistering the callback), I could see the
race resulting in qio_net_listener_channel_func() with its pre-patch
code of:

    if (listener->io_func) {
        listener->io_func(listener, sioc, listener->io_data);
    }

resulting in a NULL-pointer deref if thread 1 changed out
listener->io_func after the if but before the dispatch.

Adding a mutex might make QIONetListener slower in the time to grab
the mutex (even if it is uncontended most of the time), but if we have
a proper mutex in place, we could at least guarantee that
listener->io_func, listener->io_data, and listener->io_notify are
changed as a group, rather than sharded if the polling thread is
managing a dispatch with listener as its opaque data at the same time
the user's thread is trying to change the registered callback.

By itself, GSource DOES have a mutex lock around the GMainContext it
is attached to; but our particular use of GSource is operating outside
that locking scheme, so it looks like I've uncovered another latent
problem.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to