On 04/12/2018 12:43 PM, Richard W.M. Jones wrote: > I'm fairly sure that these bugs which appear in the Python plugin: > > https://bugzilla.redhat.com/show_bug.cgi?id=1566516 > https://bugzilla.redhat.com/show_bug.cgi?id=1566522 > > are really bugs in the SERIALIZE_ALL_REQUESTS thread model. See > the first patch for the full explanation. > > The second patch is a fix for a race condition which is probably > nudged into being by the first patch. > > Now this all works, EXCEPT it quite reliably breaks either the NBD > plugin or ‘tests/test-parallel-nbd.sh’. I have no idea why because as > far as I know the change should not affect the NBD plugin or that test > in any way. Unfortunately both are rather complex to debug.
I'm debugging the testsuite breakage: the problem is a hang in parallel thread cleanup. For example, in test-parallel-nbd.sh, we have two nbdkit processes (one running plugin 'nbd', the other running 'file'). The hang is in the nbdkit running 'file'; it is spawning multiple threads to process work requests, but then does pthread_join() on each thread in order. But each worker thread is competing to obtain read_lock prior to read()ing from the client. So, unless we get lucky that the scheduler wakes up the threads in the right order, the pthread_join is waiting for the wrong thread (one that is blocked on obtaining read_lock), while the thread that DOES have read_lock is blocked waiting for recv() from the client to pass in another command or to indicate EOF. Why did your patch expose it? It's because pre-patch, the nbdkit running 'nbd' closed the socket early without the locks (the nbdkit running 'file' then gets a return of -1 from recv(), and from there, each worker thread quickly exits, so that the while loop joining threads gets to complete); while with your patch, the nbdkit running 'nbd' does not close the socket until the unload_prevention_lock gives the all clear (the nbdkit running 'file' is thus hung waiting on recv(), even though it already knows that the client will be shutting down soon). But there's also a problem with the nbdkit running 'nbd': it is trying to pthread_join() the reader thread (to see if the server has any remaining replies); pre-patch, there was no problem joining the thread, but post-patch, both nbdkit processes are stuck in a read() waiting on the other process. I'm still trying to come up with an appropriate patch (or maybe even two patches). I think it is desirable to continue read()ing from the client even after the client has requested shutdown (NBD_CMD_DISC), if only to gracefully reply with an error message to all subsequent commands erroneously sent by an ill-behaved client. But it also seems like it is fairly obvious that once we are in shutdown mode, parallel processing is no longer important, and that it is pointless to have threads compete on the read lock if a well-behaved client will not be sending any more data, rather than blocking the worker threads on a read() that won't complete until the client actually hangs up. So the nbdkit running 'file' needs to be more robust (perhaps once we have detected that the client has requested shutdown, then limit all further read()s to a worker thread 0, so that all other threads eventually get reaped quickly); but also the nbdkit running 'nbd' needs to be fixed to not keep the socket around forever. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Libguestfs mailing list [email protected] https://www.redhat.com/mailman/listinfo/libguestfs
